Bug 132850

Summary: add nscd support for initgroups()
Product: Red Hat Enterprise Linux 3 Reporter: XC Support <xc_support>
Component: glibcAssignee: Ulrich Drepper <drepper>
Status: CLOSED UPSTREAM QA Contact: Brian Brock <bbrock>
Severity: medium Docs Contact:
Priority: medium    
Version: 3.0CC: drepper, jakub
Target Milestone: ---Keywords: FutureFeature
Target Release: ---   
Hardware: ia64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Enhancement
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2004-09-30 10:06:08 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Reproducer for the problem where nested dlopen()'s cause segmentation fault
none
output of the strace with the statically built a.out
none
output of the "nscd -d -d -d" none

Description XC Support 2004-09-17 20:04:36 UTC
Description of problem:

When you set "group: files ldap" in "/etc/nsswitch.conf" and you have 
a statically build application, a call to "initgroups()" call cause a 
segmentation fault.

Version-Release number of selected component (if applicable):

glibc-2.3.2-95.20

How reproducible:


Steps to Reproduce:
1. Set "group: files ldap" in "/etc/nsswitch.conf"
2. Use the following reproducer program. The user is "mysql", but you 
can choose another.

#include <stdio.h>
#include <grp.h>
#include <pwd.h>
#include <errno.h>

main()
{
       struct passwd *pw_ptr;

       char *user = "mysql";

       pw_ptr = getpwnam(user);

       printf("pw_ptr->pw_gid = %d\n", pw_ptr->pw_gid);

       initgroups((char*) user, pw_ptr->pw_gid);
}

3. Compile with "cc filename.c -static"
  
4. Run "a.out".

Actual results:

# ./a.out
pw_ptr->pw_gid = 101
Segmentation fault

Expected results:

# ./a.out
pw_ptr->pw_gid = 101


Additional info:

This only happens when compiling with "-static".

Comment 1 XC Support 2004-09-21 19:14:23 UTC
*** Bug 133116 has been marked as a duplicate of this bug. ***

Comment 2 XC Support 2004-09-27 15:04:30 UTC
I have further analyzed the problem and have determined the exact 
cause of the problem.  I am hoping the RedHat could provide a fix for 
this problem now that the cause of the problem is understood.  The 
details are below.


Problem: Nested "dlopen()" calls from a statically built application 
will cause a segmentation fault.

Example: A statically built application a.out does a dlopen() of 
libfoo1.so. In turn, libfoo1.so does a dlopen() of libfoo2.so. The 
second dlopen(), which is libfoo2.so, will cause a segmentation fault.

Cause: The segmentation fault occurs in the dynamic loader ld.so in 
the function _dl_catch_error() [elf/dl-error.c] due to an 
uninitialized function pointer GL(dl_error_catch_tsd) which, after 
macro expansion, is really _rltd_local._dl_error_catch_tsd 
[sysdeps/generic/ldsodefs.h].  Thus, the question becomes, why isn't 
GL(dl_error_catch_tsd) being initialized during the second dlopen()?  
Keep in mind that I'm picking on GL(dl_error_catch_tsd) because that 
is where the segmentation fault occured.  There are likely other 
variables in the _rtld_local structure may be uninitialized as well.

An explanation follows for both the statically built case, which 
crashes, and the dynamically built case, which works.

Application Built Statically (segmentation fault)
-------------------------------------------------

For libc.a, the GL(dl_error_catch_tsd) macro expands to the variable 
shown below [elf/dl-tsd.c]

# ifndef SHARED
...
    void **(*_dl_error_catch_tsd) (void) __attribute__ ((const)) = 
&_dl_initial_error_catch_tsd;
...

#endif
 
Thus, libc.a has an initialized copy of _dl_error_catch_tsd which 
points to the _dl_initial_error_catch_tsd routine.

# nm -A /usr/lib64/libc.a | grep error_catch_tsd

/usr/lib64/libc.a:dl-error.o:               U _dl_error_catch_tsd
/usr/lib64/libc.a:dl-tsd.o:0000000000000000 D _dl_error_catch_tsd
/usr/lib64/libc.a:dl-tsd.o:0000000000000000 T 
_dl_initial_error_catch_tsd

Also in libc.a, the _dl_catch_error function is defined, which is the 
routine in which the segmentation fault occurs.

# nm -A /usr/lib64/libc.a | grep dl_catch_error

/usr/lib64/libc.a:dl-deps.o:                  U _dl_catch_error
/usr/lib64/libc.a:dl-error.o:0000000000000000 T _dl_catch_error
/usr/lib64/libc.a:dl-open.o:                  U _dl_catch_error
/usr/lib64/libc.a:dl-libc.o:                  U _dl_catch_error

For libc.so, none of the symbols mentioned above are defined.

The a.out has the symbols because it was compiled with libc.a.

Thus, the first call to dlopen( libfoo1.so ) resolves its symbols 
from the a.out address space.  That is, it calls the _dl_catch_error 
routine in the a.out address space which, in turn, accesses the 
_dl_error_catch_tsd function pointer in the a.out address space which 
was initialized with the address of the _dl_initial_error_catch_tsd 
routine, which also exists in the a.out address space.

By the way, the reason I know what address space things are coming 
from is because I put "_dl_printf" statements in the "glibc" sources 
and compared the addresses that were printed at runtime with the 
addresses shown in "/proc/<pid>/maps".

The second call to dlopen( libfoo2.so ) tries to resolve its symbols 
from the ld.so (loader) address space. 

Before I continue, let me say a few words about ld.so.   During the 
compilation of the loader, the GL(dl_error_catch_tsd) macro expands 
to _rtld_local._dl_error_catch_tsd [sysdeps/generic/ldsodefs.h], a 
totally different variable that the one in libc.a.  That is, GL
(dl_error_catch_tsd) expands to a different variable in libc.a than 
ld.so as can be seen by the code snippet shown below 
from "sysdeps/generic/ldsodefs.h"

#ifndef SHARED
# define EXTERN extern
# define GL(name) _##name
#else
# define EXTERN
# ifdef IS_IN_rtld
# define GL(name) _rtld_local._##name
# else
# define GL(name) _rtld_global._##name
# endif
 
As you can see, during the compilation of libc.a, which is NOT 
SHARED, GL(dl_error_catch_tsd) becomes _dl_error_catch_tsd.  In the 
compilation of ld.so, GL(dl_error_catch_tsd) expands to 
_rtld_local._dl_error_catch_tsd.  The reason I mention this is 
because we can't even think about using libc.a's object because they 
are completely different.

Anyway, back to the second call to dlopen( libfoo2.so ).  This is 
going to call the _dl_error_catch routine in the ld.so's address 
space.  The problem is that, for the loader, GL(dl_error_catch_tsd) 
gets initialized in dl_main [elf/rtld.c], but dl_main only gets 
called for shared applications, not during a dlopen.  Therefore, GL
(dl_error_catch_tsd) never gets initialized and, when it is 
referenced in _dl_catch_error [elf/dl-error.c], it contains a value 
a "0" (NULL pointer) which causes a segmentation fault.

So, why does the first dlopen( libfoo1.so ) execute routines in the 
a.out, while the second dlopen( libfoo2.so ) execute routines in 
ld.so? 

The reason is that when the a.out calls dlopen() it uses the dlopen 
statically linked in from libdl.a .  When the first library calls 
dlopen() it get resolved to the one in the pulled-in libdl.so.  
That's because the a.out does NOT have a ** dynamic symbol table ** 
(separate from externals and debug symbols) so the first library 
can't hook back to the dlopen() in the a.out.  Thus it must use the 
one pulled in from libdl.so.


Application Built With Shared Libraries (works)
-----------------------------------------------

In the case where the a.out is built with shared libraries, the 
ld.so's (loader) dl_main [elf/rtld.c] routine is called which will 
initialize GL(dl_error_catch_tsd), so we don't get a segmentation 
fault since the variable is properly initialized.

Conclusion
----------

One possible fix would be to put a check in either _dl_catch_error 
[elf/dl-error.c] or dlerror_run [elf/dl-libc.c] to see if we are in 
the loader code and if dl_main has NOT been called.  If we are in the 
loader code and dl_main has not been called, then we need to 
initialize GL(dl_error_catch_tsd) and other needed variables so that 
we don't get a segmentation fault due to uninitialized variables.

I will be adding a small reproducer for this problem shortly.

Rigoberto Corujo


Comment 3 XC Support 2004-09-27 15:08:36 UTC
Created attachment 104377 [details]
Reproducer for the problem where nested dlopen()'s cause segmentation fault

Untar this file and compile with the "compile.sh" script.

Set LD_LIBRARY_PATH to your working directory.

Run the "a.out"

Comment 4 Jakub Jelinek 2004-09-27 15:32:04 UTC
dlopen support in statically linked apps is very limited, not meant
to be general purpose library loader for any kind of libraries.
Its role is just to support NSS modules (built against the same
libc as later run on).
dlopen from within the dlopened libraries is definitely not supported.

If libnss_ldap.so.* calls dlopen, then the bug is in that library.

For NSS purposes there is _dl_open_hook through which libraries
that call __libc_dlopen/__libc_dlsym/__libc_dlclose can use the
loader in the statically linked binary.

Comment 5 Ulrich Drepper 2004-09-27 15:44:06 UTC
Using any NSS functionality in statically linked applications is only
supportable if nscd is used.  Without nscd you are on your own.  We
will not and *can not* handle anything else.

I don't think it makes any sense to keep this bug open.  It is an
installation problem if nscd is not running.

Comment 6 XC Support 2004-09-27 18:15:08 UTC
Ulrich,

Are you saying that "service nscd start" would prevent the 
segmentation fault from occuring?  I just tried that with the initial 
reproducer that I provided (the one that calls initgroups()) and I 
get the same results (segmentation fault).  Have you guys been 
successful in running my reproducer with nscd?

As a follow-up to Jakub's comment, I just want to add that it is 
actually "libsasl.a" that is doing the dlopen().  
The "libnss_ldap.so" library links against "libldap.a".  
The "libldap.a" links against "libsasl.a".

If the solution to this problem is to run nscd, then so be it.  But, 
there must be more to it than that because, like I said before, I 
don't see a difference.  I need some clarification, because I 
understood Jakub to mean that what was going on was illegal but 
Ulrich seems to suggest that this should work as long as nscd is 
running.

Also, if dlopen'ing a shared library from a dlopen'ed library is not 
allowed, then it would be beneficial to put a check in "glibc" so 
that an error is returned to the calling dlopen() rather than letting 
a segmentation fault occur.

Rigoberto

Comment 7 Ulrich Drepper 2004-09-27 20:17:24 UTC
> I just tried that with the initial 
> reproducer that I provided (the one that calls initgroups()) and I 
> get the same results (segmentation fault).  Have you guys been 
> successful in running my reproducer with nscd?

That is impossible unless the program cannot communicate with the nscd
and falls back on using NSS itself or you hit a different problem. 
There has been at one point a change in the protocol but I don't think
there are any such binaries out there.

Run the program using strace and eventually start nscd by hand and add
-d -d -d (three -d) to the command line.  It won't fork then and spit
out lots of information.


Comment 8 XC Support 2004-09-27 21:00:50 UTC
Ulrich,

I followed your instructions.  Every time I run my "a.out" there is 
output from "nscd", so there is communication going on.  The 
segmentation fault is still occuring.

Can you confirm that you have indeed run my reproducer that calls 
initgroups() and have not had a segmentation fault?

The man page for "nscd" states that it is used to cache data.  I'm 
not sure why running this daemon would solve my problem?

Rigoberto

Comment 9 Ulrich Drepper 2004-09-27 23:00:49 UTC
> Can you confirm that you have indeed run my reproducer that calls 
> initgroups() and have not had a segmentation fault?

Which producer which calls initgroups?  There is only one attachment
and this is code which uses dlopen() for other purposes than NSS. 
This is not supported.  If it breaks, you keep the pieces.


Run your applications which uses NSS and make sure there are no other
dlopen calls in the statically linked code.  Use strace to see what is
going on.


> The man page for "nscd" states that it is used to cache data.  I'm 
> not sure why running this daemon would solve my problem?

It's not the caching part which is interesting here, it's the "nscd
takes care of using the LDAP NSS module" part.  All the statically
linked application has to do is to communicate the request via a
socket to nscd and receive the result.  No NSS modules involved on the
client side.  Which is why I say that if you still see NSS modules
used, something is wrong.

One possibility is that you use services other than passwd, group, or
hosts.  Is this the case?  These services are currently not supported
in nscd.  There is usually no need for this since plain files are
enough (/etc/services etc don't change).

So, please make sure your code does not use dlopen() for anything but
NSS and that after starting nscd either it is used or only
libnss_files is used.

Comment 10 XC Support 2004-09-27 23:27:53 UTC
Ulrich,

Either I'm misunderstanding you, you're misunderstanding me, or we're 
both misunderstanding each other.  Please take a look at the very 
first entry I made to this bugzilla.  Would you please compile and 
run the code as I described and then tell me whether you see the same 
problem I'm seeing?  This problem has nothing to do with any 
application that I'm writing.  The second reproducer, which I had 
attached, was merely to show what is happening under the covers in an 
easy to understand way.  The first reproducer, which I embedded 
directly into the text I entered, is at the heart of the problem.  
Please take a look at that and then we can continue our discussion.

Rigoberto

Comment 11 Ulrich Drepper 2004-09-28 00:02:16 UTC
Why don't you just attach the data I'm looking for?  Yes, your code
uses initgroups and this cannot fail if nscd is used.  Which is why I
ask for the strace output related to the initgroups call and the
actual crash.

Since I do not believe that you can continue to see the same crash
with and without nscd (unless there is something broken in nscd) I
also asked for other places you might use dlopen (explicitly or
implicitly).

So, run strace.

FWIW, with a FC3t2 system I have no problem using the LDAP NSS module
from the statically linked executable but this pure luck.  Important
is that once nscd runs no NSS module is used.

Comment 12 XC Support 2004-09-28 12:28:07 UTC
Created attachment 104426 [details]
output of the strace with the statically built a.out

The LDAP database contains only one user "johndoe" as well as the group
"johndoe".  Running the "id johndoe" command verifies that communications with
the slapd server is good.  The "nscd -d -d -d" is also running.  Communication
with it also appears to be good.  I will attach the output of "ncsd -d -d -d"
shortly.

Comment 13 XC Support 2004-09-28 12:28:41 UTC
Created attachment 104427 [details]
output of the "nscd -d -d -d"

Comment 14 XC Support 2004-09-28 12:31:51 UTC
Comment on attachment 104427 [details]
output of the "nscd -d -d -d"

The "nscd -d -d -d" is started freshly.  The "strace a.out" is immediately run.
The output of "nscd" is shown.	The "a.out" is still getting a segmentation
fault.

Comment 15 Ulrich Drepper 2004-09-28 18:13:09 UTC
I see what is going on.  The initgroup calls do not try to use nscd at
all but instead use the NSS modules directly.  This is fatal in this
situation.

We might be able to get some code changes into one of the next RHEL3
updates but there is not much we can do right now.  Except questioning
why you have to link statically.  This is nothing but disadvantages.


Comment 16 XC Support 2004-09-28 18:29:38 UTC
Ulrich,

I, like you, work for support.  You work for RedHat support and I 
work for HP support.  Our XC (Extreme Clusters) product is based on 
RedHat Linux.  One of our customers had asked us to document how to 
configure LDAP.  While configuring LDAP, I found that "mysqld" did 
not start when LDAP was configured.  After further analysis, I found 
that mysqld was linked statically and called initgroups().  To work 
around the mysqld problem we simply used a non-static version of 
mysqld.  However, this was a concern to me because there may be other 
packages, or customer written applications, which could potentially 
run into this problem.  So, I had to get to the bottom of the 
situation and find out why statically built applications which called 
initgroups() would seg fault.  This has led to this conversation that 
you and I have been having.  As you can see, it is not I who is 
developing statically linked applications, but I am concerned that 
customers who do develop statically linked applications and turn on 
LDAP may run into this problem.

At the very least, for the short term, that second dlopen() should 
return an error and not seg fault.  Maybe errno could be set to EPERM 
(operation not permitted) or something along those lines.

So, we are leaving this as a "to be fixed in a future release", 
correct?

Rigoberto 

Comment 17 Ulrich Drepper 2004-09-28 18:36:13 UTC
I'm reassigning this bug to glibc and marked it as an enhancement. 
This is what it is, NSS simply isn't supported in statically linked
applications.  The summary has been changed to reflect the status.

If you are entitled to support for these kind of issues you should
bring this issue up with your Red Hat representative so that it can be
added to IssueTracker.  If you don't know what this is then you are
likely not entitled and you might want to consider getting appropriate
service agreements.

Comment 18 Ulrich Drepper 2004-09-28 18:43:09 UTC
> At the very least, for the short term, that second dlopen() should 
> return an error and not seg fault.

No, since there are situations when it works.  NSS in statically
linked code is simply an "if it breaks you keep the pieces" thing, if
it works you can be very happy, if not, you'll have the find another
way.  I cannot prevent people from having at least the opportunity to
get it to work.


> So, we are leaving this as a "to be fixed in a future release", 
> correct?

Yes.  I'll keep this bug open so that once we have code for this, I
can announce it.  Whether we can use this in code in future RHEL3
updates is another issue.

Comment 19 Ulrich Drepper 2004-09-30 10:06:08 UTC
I added support for caching initgroups data in the current upstream
glibc.  Backporting the changes to RHEL3 is likely not going to happen
since the whole program changed dramatically since the fork of the
sources for RHEL3.  If it is essential, contact your representative
for support from Red Hat.  I close this bug since the improvement has
been implemented.