Bug 132850
Summary: | add nscd support for initgroups() | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 3 | Reporter: | XC Support <xc_support> | ||||||||
Component: | glibc | Assignee: | Ulrich Drepper <drepper> | ||||||||
Status: | CLOSED UPSTREAM | QA Contact: | Brian Brock <bbrock> | ||||||||
Severity: | medium | Docs Contact: | |||||||||
Priority: | medium | ||||||||||
Version: | 3.0 | CC: | drepper, jakub | ||||||||
Target Milestone: | --- | Keywords: | FutureFeature | ||||||||
Target Release: | --- | ||||||||||
Hardware: | ia64 | ||||||||||
OS: | Linux | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | Doc Type: | Enhancement | |||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | Environment: | ||||||||||
Last Closed: | 2004-09-30 10:06:08 UTC | Type: | --- | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Attachments: |
|
Description
XC Support
2004-09-17 20:04:36 UTC
*** Bug 133116 has been marked as a duplicate of this bug. *** I have further analyzed the problem and have determined the exact cause of the problem. I am hoping the RedHat could provide a fix for this problem now that the cause of the problem is understood. The details are below. Problem: Nested "dlopen()" calls from a statically built application will cause a segmentation fault. Example: A statically built application a.out does a dlopen() of libfoo1.so. In turn, libfoo1.so does a dlopen() of libfoo2.so. The second dlopen(), which is libfoo2.so, will cause a segmentation fault. Cause: The segmentation fault occurs in the dynamic loader ld.so in the function _dl_catch_error() [elf/dl-error.c] due to an uninitialized function pointer GL(dl_error_catch_tsd) which, after macro expansion, is really _rltd_local._dl_error_catch_tsd [sysdeps/generic/ldsodefs.h]. Thus, the question becomes, why isn't GL(dl_error_catch_tsd) being initialized during the second dlopen()? Keep in mind that I'm picking on GL(dl_error_catch_tsd) because that is where the segmentation fault occured. There are likely other variables in the _rtld_local structure may be uninitialized as well. An explanation follows for both the statically built case, which crashes, and the dynamically built case, which works. Application Built Statically (segmentation fault) ------------------------------------------------- For libc.a, the GL(dl_error_catch_tsd) macro expands to the variable shown below [elf/dl-tsd.c] # ifndef SHARED ... void **(*_dl_error_catch_tsd) (void) __attribute__ ((const)) = &_dl_initial_error_catch_tsd; ... #endif Thus, libc.a has an initialized copy of _dl_error_catch_tsd which points to the _dl_initial_error_catch_tsd routine. # nm -A /usr/lib64/libc.a | grep error_catch_tsd /usr/lib64/libc.a:dl-error.o: U _dl_error_catch_tsd /usr/lib64/libc.a:dl-tsd.o:0000000000000000 D _dl_error_catch_tsd /usr/lib64/libc.a:dl-tsd.o:0000000000000000 T _dl_initial_error_catch_tsd Also in libc.a, the _dl_catch_error function is defined, which is the routine in which the segmentation fault occurs. # nm -A /usr/lib64/libc.a | grep dl_catch_error /usr/lib64/libc.a:dl-deps.o: U _dl_catch_error /usr/lib64/libc.a:dl-error.o:0000000000000000 T _dl_catch_error /usr/lib64/libc.a:dl-open.o: U _dl_catch_error /usr/lib64/libc.a:dl-libc.o: U _dl_catch_error For libc.so, none of the symbols mentioned above are defined. The a.out has the symbols because it was compiled with libc.a. Thus, the first call to dlopen( libfoo1.so ) resolves its symbols from the a.out address space. That is, it calls the _dl_catch_error routine in the a.out address space which, in turn, accesses the _dl_error_catch_tsd function pointer in the a.out address space which was initialized with the address of the _dl_initial_error_catch_tsd routine, which also exists in the a.out address space. By the way, the reason I know what address space things are coming from is because I put "_dl_printf" statements in the "glibc" sources and compared the addresses that were printed at runtime with the addresses shown in "/proc/<pid>/maps". The second call to dlopen( libfoo2.so ) tries to resolve its symbols from the ld.so (loader) address space. Before I continue, let me say a few words about ld.so. During the compilation of the loader, the GL(dl_error_catch_tsd) macro expands to _rtld_local._dl_error_catch_tsd [sysdeps/generic/ldsodefs.h], a totally different variable that the one in libc.a. That is, GL (dl_error_catch_tsd) expands to a different variable in libc.a than ld.so as can be seen by the code snippet shown below from "sysdeps/generic/ldsodefs.h" #ifndef SHARED # define EXTERN extern # define GL(name) _##name #else # define EXTERN # ifdef IS_IN_rtld # define GL(name) _rtld_local._##name # else # define GL(name) _rtld_global._##name # endif As you can see, during the compilation of libc.a, which is NOT SHARED, GL(dl_error_catch_tsd) becomes _dl_error_catch_tsd. In the compilation of ld.so, GL(dl_error_catch_tsd) expands to _rtld_local._dl_error_catch_tsd. The reason I mention this is because we can't even think about using libc.a's object because they are completely different. Anyway, back to the second call to dlopen( libfoo2.so ). This is going to call the _dl_error_catch routine in the ld.so's address space. The problem is that, for the loader, GL(dl_error_catch_tsd) gets initialized in dl_main [elf/rtld.c], but dl_main only gets called for shared applications, not during a dlopen. Therefore, GL (dl_error_catch_tsd) never gets initialized and, when it is referenced in _dl_catch_error [elf/dl-error.c], it contains a value a "0" (NULL pointer) which causes a segmentation fault. So, why does the first dlopen( libfoo1.so ) execute routines in the a.out, while the second dlopen( libfoo2.so ) execute routines in ld.so? The reason is that when the a.out calls dlopen() it uses the dlopen statically linked in from libdl.a . When the first library calls dlopen() it get resolved to the one in the pulled-in libdl.so. That's because the a.out does NOT have a ** dynamic symbol table ** (separate from externals and debug symbols) so the first library can't hook back to the dlopen() in the a.out. Thus it must use the one pulled in from libdl.so. Application Built With Shared Libraries (works) ----------------------------------------------- In the case where the a.out is built with shared libraries, the ld.so's (loader) dl_main [elf/rtld.c] routine is called which will initialize GL(dl_error_catch_tsd), so we don't get a segmentation fault since the variable is properly initialized. Conclusion ---------- One possible fix would be to put a check in either _dl_catch_error [elf/dl-error.c] or dlerror_run [elf/dl-libc.c] to see if we are in the loader code and if dl_main has NOT been called. If we are in the loader code and dl_main has not been called, then we need to initialize GL(dl_error_catch_tsd) and other needed variables so that we don't get a segmentation fault due to uninitialized variables. I will be adding a small reproducer for this problem shortly. Rigoberto Corujo Created attachment 104377 [details]
Reproducer for the problem where nested dlopen()'s cause segmentation fault
Untar this file and compile with the "compile.sh" script.
Set LD_LIBRARY_PATH to your working directory.
Run the "a.out"
dlopen support in statically linked apps is very limited, not meant to be general purpose library loader for any kind of libraries. Its role is just to support NSS modules (built against the same libc as later run on). dlopen from within the dlopened libraries is definitely not supported. If libnss_ldap.so.* calls dlopen, then the bug is in that library. For NSS purposes there is _dl_open_hook through which libraries that call __libc_dlopen/__libc_dlsym/__libc_dlclose can use the loader in the statically linked binary. Using any NSS functionality in statically linked applications is only supportable if nscd is used. Without nscd you are on your own. We will not and *can not* handle anything else. I don't think it makes any sense to keep this bug open. It is an installation problem if nscd is not running. Ulrich, Are you saying that "service nscd start" would prevent the segmentation fault from occuring? I just tried that with the initial reproducer that I provided (the one that calls initgroups()) and I get the same results (segmentation fault). Have you guys been successful in running my reproducer with nscd? As a follow-up to Jakub's comment, I just want to add that it is actually "libsasl.a" that is doing the dlopen(). The "libnss_ldap.so" library links against "libldap.a". The "libldap.a" links against "libsasl.a". If the solution to this problem is to run nscd, then so be it. But, there must be more to it than that because, like I said before, I don't see a difference. I need some clarification, because I understood Jakub to mean that what was going on was illegal but Ulrich seems to suggest that this should work as long as nscd is running. Also, if dlopen'ing a shared library from a dlopen'ed library is not allowed, then it would be beneficial to put a check in "glibc" so that an error is returned to the calling dlopen() rather than letting a segmentation fault occur. Rigoberto > I just tried that with the initial
> reproducer that I provided (the one that calls initgroups()) and I
> get the same results (segmentation fault). Have you guys been
> successful in running my reproducer with nscd?
That is impossible unless the program cannot communicate with the nscd
and falls back on using NSS itself or you hit a different problem.
There has been at one point a change in the protocol but I don't think
there are any such binaries out there.
Run the program using strace and eventually start nscd by hand and add
-d -d -d (three -d) to the command line. It won't fork then and spit
out lots of information.
Ulrich, I followed your instructions. Every time I run my "a.out" there is output from "nscd", so there is communication going on. The segmentation fault is still occuring. Can you confirm that you have indeed run my reproducer that calls initgroups() and have not had a segmentation fault? The man page for "nscd" states that it is used to cache data. I'm not sure why running this daemon would solve my problem? Rigoberto > Can you confirm that you have indeed run my reproducer that calls > initgroups() and have not had a segmentation fault? Which producer which calls initgroups? There is only one attachment and this is code which uses dlopen() for other purposes than NSS. This is not supported. If it breaks, you keep the pieces. Run your applications which uses NSS and make sure there are no other dlopen calls in the statically linked code. Use strace to see what is going on. > The man page for "nscd" states that it is used to cache data. I'm > not sure why running this daemon would solve my problem? It's not the caching part which is interesting here, it's the "nscd takes care of using the LDAP NSS module" part. All the statically linked application has to do is to communicate the request via a socket to nscd and receive the result. No NSS modules involved on the client side. Which is why I say that if you still see NSS modules used, something is wrong. One possibility is that you use services other than passwd, group, or hosts. Is this the case? These services are currently not supported in nscd. There is usually no need for this since plain files are enough (/etc/services etc don't change). So, please make sure your code does not use dlopen() for anything but NSS and that after starting nscd either it is used or only libnss_files is used. Ulrich, Either I'm misunderstanding you, you're misunderstanding me, or we're both misunderstanding each other. Please take a look at the very first entry I made to this bugzilla. Would you please compile and run the code as I described and then tell me whether you see the same problem I'm seeing? This problem has nothing to do with any application that I'm writing. The second reproducer, which I had attached, was merely to show what is happening under the covers in an easy to understand way. The first reproducer, which I embedded directly into the text I entered, is at the heart of the problem. Please take a look at that and then we can continue our discussion. Rigoberto Why don't you just attach the data I'm looking for? Yes, your code uses initgroups and this cannot fail if nscd is used. Which is why I ask for the strace output related to the initgroups call and the actual crash. Since I do not believe that you can continue to see the same crash with and without nscd (unless there is something broken in nscd) I also asked for other places you might use dlopen (explicitly or implicitly). So, run strace. FWIW, with a FC3t2 system I have no problem using the LDAP NSS module from the statically linked executable but this pure luck. Important is that once nscd runs no NSS module is used. Created attachment 104426 [details]
output of the strace with the statically built a.out
The LDAP database contains only one user "johndoe" as well as the group
"johndoe". Running the "id johndoe" command verifies that communications with
the slapd server is good. The "nscd -d -d -d" is also running. Communication
with it also appears to be good. I will attach the output of "ncsd -d -d -d"
shortly.
Created attachment 104427 [details]
output of the "nscd -d -d -d"
Comment on attachment 104427 [details]
output of the "nscd -d -d -d"
The "nscd -d -d -d" is started freshly. The "strace a.out" is immediately run.
The output of "nscd" is shown. The "a.out" is still getting a segmentation
fault.
I see what is going on. The initgroup calls do not try to use nscd at all but instead use the NSS modules directly. This is fatal in this situation. We might be able to get some code changes into one of the next RHEL3 updates but there is not much we can do right now. Except questioning why you have to link statically. This is nothing but disadvantages. Ulrich, I, like you, work for support. You work for RedHat support and I work for HP support. Our XC (Extreme Clusters) product is based on RedHat Linux. One of our customers had asked us to document how to configure LDAP. While configuring LDAP, I found that "mysqld" did not start when LDAP was configured. After further analysis, I found that mysqld was linked statically and called initgroups(). To work around the mysqld problem we simply used a non-static version of mysqld. However, this was a concern to me because there may be other packages, or customer written applications, which could potentially run into this problem. So, I had to get to the bottom of the situation and find out why statically built applications which called initgroups() would seg fault. This has led to this conversation that you and I have been having. As you can see, it is not I who is developing statically linked applications, but I am concerned that customers who do develop statically linked applications and turn on LDAP may run into this problem. At the very least, for the short term, that second dlopen() should return an error and not seg fault. Maybe errno could be set to EPERM (operation not permitted) or something along those lines. So, we are leaving this as a "to be fixed in a future release", correct? Rigoberto I'm reassigning this bug to glibc and marked it as an enhancement. This is what it is, NSS simply isn't supported in statically linked applications. The summary has been changed to reflect the status. If you are entitled to support for these kind of issues you should bring this issue up with your Red Hat representative so that it can be added to IssueTracker. If you don't know what this is then you are likely not entitled and you might want to consider getting appropriate service agreements. > At the very least, for the short term, that second dlopen() should > return an error and not seg fault. No, since there are situations when it works. NSS in statically linked code is simply an "if it breaks you keep the pieces" thing, if it works you can be very happy, if not, you'll have the find another way. I cannot prevent people from having at least the opportunity to get it to work. > So, we are leaving this as a "to be fixed in a future release", > correct? Yes. I'll keep this bug open so that once we have code for this, I can announce it. Whether we can use this in code in future RHEL3 updates is another issue. I added support for caching initgroups data in the current upstream glibc. Backporting the changes to RHEL3 is likely not going to happen since the whole program changed dramatically since the fork of the sources for RHEL3. If it is essential, contact your representative for support from Red Hat. I close this bug since the improvement has been implemented. |