Bug 1621927

Summary: glibc: [RFE][LLNL 7.7 Bug] Implement RTLD_PARENT for glibc.
Product: Red Hat Enterprise Linux 8 Reporter: Ben Woodard <woodard>
Component: glibcAssignee: glibc team <glibc-bugzilla>
Status: CLOSED UPSTREAM QA Contact: qe-baseos-tools-bugs
Severity: low Docs Contact:
Priority: unspecified    
Version: 8.2CC: ashankar, codonell, dj, fweimer, mgrondona, mnewsome, pfrankli, tgummels, woodard
Target Milestone: rcKeywords: FutureFeature, Triaged
Target Release: 8.2   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-01-20 14:45:23 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1599298    
Attachments:
Description Flags
reproducer. none

Description Ben Woodard 2018-08-23 22:24:27 UTC
Created attachment 1478359 [details]
reproducer.

Description of problem:
If you dlopen a library with RTLD_NOW|RTLD_LOCAL (main.c) it puts the library in its own local linkmap. The libraries required by this library and its dependencies are also loaded into this namespace and the symbols used by them are satisfied by searching this namespace's linkmap.

However, if one of those required libraries dlopen's a library then the local linkmap from which this library is searching is not searched.

Version-Release number of selected component (if applicable):
glibc-2.17-222.el7.x86_64.rpm
but the problem appears to continues to exist all the way up to the latest glibc in f28 glibc-2.27-30.fc28.x86_64

How reproducible:
Always.

Steps to Reproduce:
1. download the reproducer
2. mkdir /tmp/test
3. ./configure --prefix=/tmp/test
4. make
5. make install
6. ./main

Actual results:
[ben@Mustang DL-link]$ ./main 
Lib: /home/ben/Work/DL-link/test/lib/liba.so
Starting liba
Starting libb
Starting libe
libe.la: /home/ben/Work/DL-link/test/lib/libd.so: undefined symbol: libe_func

Expected results:
It can find the libe_func

Additional info:
This has come up a couple of times with lua. In the reproducer liblua.so is represented by libe.so and one of its modules happens to be cpuset.so which is represented by libd.so

One possible way to work around this would be to add a requirement on the lua module back to the liblua.so core library. By uncommenting the line in the Makefile.am that says:
#libd_la_LIBADD = libe.la

When you do this, it obviously fixes the problem. However, this is considered to be a bad practice: http://lua-users.org/wiki/BuildingModules see the section with the title: "Do Not Link Modules to the Lua Core Libraries"

"In case you built a shared library containing the Lua core (*), please do not link any modules against it, too. I.e. do not specify the name of the Lua core library on the linker line (Windows DLLs are an exception). This makes a hard forward dependency where you really want a lazy backward dependency. Loading this module with a statically linked Lua interpreter would essentially drag in a 2nd Lua core and you get the same problems mentioned above.

"Lua modules expect that the Lua core is already present before they are loaded. This works because the Lua core symbols are exported globally (e.g. with -Wl,-E for GCC on most ELF systems -- see the Lua Makefiles).

And all the lua modules that we ship with the distro follow this guideline. This leads me to believe that the underlying problem is that the search context used by the dynamic linker is not correct in this context. It should refer back to the name local linkmap from which the call is being made rather than only searching the global linkmap. 

On the other hand python site packages appear to have links to their python libraries so it could easily be that the problem is the advice given by the lua developers and also their packaging.

Since this has come up a couple of times. The correct behavior should be clearly documented.

Comment 2 Ben Woodard 2018-08-23 22:42:44 UTC
This problem has also cropped up when writing custom pam modules.

The original problem report was (in hopes that this makes it easier to understand):

Got a DSO problem that I think there *must* be a better way to solve.
I have a dlopened module in a main program that itself uses a library which links against Lua.
The library is used to open Lua scripts which serve as configuration. The Lua script calls Lua's `require` function which itself dlopens a C Lua module
That Lua C module gets an error from ld.so "can't find symbol lua_gettop" which is a symbol from liblua.so
liblua is linked to the library which is loading the lua script
The only way around this I've found so far is to dlopen(liblua.so) from the module of the first part with RTLD_GLOBAL to force the liblua symbols global for the program so that they are visible to libraries it dlopens
seems like there should be a simpler way

I've run into this problem in the past and only figured it out far enough to use the dlopen() trick
If having real program names helps, it is flux-broker->dlopen("sched.so")->links_with("librdl.so")->lua_loadfile ("rdl.lua")->dlopen("cpuset.so")->"undefined symbol lua_gettop"
librdl.so is linked with liblua.so
when librdl is used outside of a dlopened module, symbol resolution works fine
sched.so is dlopened with RTLD_LOCAL|RTLD_NOW|RTLD_DEEPBIND
we can't change *that* dlopen to RTLD_GLOBAL because symbols in the modules loaded by the flux-broker process are the same

It isn't used by the main program, only linked to librdl which itself is linked to it sched.so

The relevant part of the LD_DEBUG output is:

     28282:     relocation processing: /home/ben/Work/DL-link/test/lib/libd.so
     28282:     symbol=_ITM_deregisterTMCloneTable;  lookup in file=./main [0]
     28282:     symbol=_ITM_deregisterTMCloneTable;  lookup in file=/lib64/libdl.so.2 [0]
     28282:     symbol=_ITM_deregisterTMCloneTable;  lookup in file=/lib64/libc.so.6 [0]
     28282:     symbol=_ITM_deregisterTMCloneTable;  lookup in file=/lib64/ld-linux-x86-64.so.2 [0]
     28282:     symbol=_ITM_deregisterTMCloneTable;  lookup in file=/home/ben/Work/DL-link/test/lib/libd.so [0]
     28282:     symbol=_ITM_deregisterTMCloneTable;  lookup in file=/lib64/libdl.so.2 [0]
     28282:     symbol=_ITM_deregisterTMCloneTable;  lookup in file=/lib64/libc.so.6 [0]
     28282:     symbol=_ITM_deregisterTMCloneTable;  lookup in file=/lib64/ld-linux-x86-64.so.2 [0]
     28282:     symbol=__gmon_start__;  lookup in file=./main [0]
     28282:     symbol=__gmon_start__;  lookup in file=/lib64/libdl.so.2 [0]
     28282:     symbol=__gmon_start__;  lookup in file=/lib64/libc.so.6 [0]
     28282:     symbol=__gmon_start__;  lookup in file=/lib64/ld-linux-x86-64.so.2 [0]
     28282:     symbol=__gmon_start__;  lookup in file=/home/ben/Work/DL-link/test/lib/libd.so [0]
     28282:     symbol=__gmon_start__;  lookup in file=/lib64/libdl.so.2 [0]
     28282:     symbol=__gmon_start__;  lookup in file=/lib64/libc.so.6 [0]
     28282:     symbol=__gmon_start__;  lookup in file=/lib64/ld-linux-x86-64.so.2 [0]
     28282:     symbol=_ITM_registerTMCloneTable;  lookup in file=./main [0]
     28282:     symbol=_ITM_registerTMCloneTable;  lookup in file=/lib64/libdl.so.2 [0]
     28282:     symbol=_ITM_registerTMCloneTable;  lookup in file=/lib64/libc.so.6 [0]
     28282:     symbol=_ITM_registerTMCloneTable;  lookup in file=/lib64/ld-linux-x86-64.so.2 [0]
     28282:     symbol=_ITM_registerTMCloneTable;  lookup in file=/home/ben/Work/DL-link/test/lib/libd.so [0]
     28282:     symbol=_ITM_registerTMCloneTable;  lookup in file=/lib64/libdl.so.2 [0]
     28282:     symbol=_ITM_registerTMCloneTable;  lookup in file=/lib64/libc.so.6 [0]
     28282:     symbol=_ITM_registerTMCloneTable;  lookup in file=/lib64/ld-linux-x86-64.so.2 [0]
     28282:     symbol=__cxa_finalize;  lookup in file=./main [0]
     28282:     symbol=__cxa_finalize;  lookup in file=/lib64/libdl.so.2 [0]
     28282:     symbol=__cxa_finalize;  lookup in file=/lib64/libc.so.6 [0]
     28282:     binding file /home/ben/Work/DL-link/test/lib/libd.so [0] to /lib64/libc.so.6 [0]: normal symbol `__cxa_finalize' [GLIBC_2.2.5
]
     28282:     symbol=libe_func;  lookup in file=./main [0]
     28282:     symbol=libe_func;  lookup in file=/lib64/libdl.so.2 [0]
     28282:     symbol=libe_func;  lookup in file=/lib64/libc.so.6 [0]
     28282:     symbol=libe_func;  lookup in file=/lib64/ld-linux-x86-64.so.2 [0]
     28282:     symbol=libe_func;  lookup in file=/home/ben/Work/DL-link/test/lib/libd.so [0]
     28282:     symbol=libe_func;  lookup in file=/lib64/libdl.so.2 [0]
     28282:     symbol=libe_func;  lookup in file=/lib64/libc.so.6 [0]
     28282:     symbol=libe_func;  lookup in file=/lib64/ld-linux-x86-64.so.2 [0]
     28282:     /home/ben/Work/DL-link/test/lib/libd.so: error: symbol lookup error: undefined symbol: libe_func (fatal)
     28282:     
     28282:     file=/home/ben/Work/DL-link/test/lib/libd.so [0];  destroying link map

From that you can see that it is never searching the local namespace from which the call to libe_func() is called. That call is being made libd.so's start_libd() function.

Comment 4 Carlos O'Donell 2018-08-24 20:45:35 UTC
(In reply to Ben Woodard from comment #0)
> However, if one of those required libraries dlopen's a library then the
> local linkmap from which this library is searching is not searched.

This is behaving exactly as expected.

When libe dlopen's libd with RTLD_LOCAL, then libd has it's own lookup scope that does *not* include libe. This is the semantics of RTLD_LOCAL.

If you do not want to link libd against libe, then you must load libe RTLD_GLOBAL. Youd on't explain why this is not an option. I expect that you don't want to pollute the global lookup scope with the lua symbols.

A future alternative here will be dlmopen, since you could open a new namespace and then load lua in that namespace with RTLD_GLOBAL, and still avoid the pollution of the normal base namespace. I'm reviewing Collabora's patches for dlmopen upstream, so it looks like glibc 2.29 might have some interesting support for this.

The case of the PAM modules is more interesting, but still the same case. If librdl.so is going to use LUA and it expects to be loaded with RTLD_LOCAL, then it must *reload* lua with RTLD_GLOBAL, and this is called a "promotion" in which case ld.so should promote LUA to RTLD_GLOBAL binding.

I don't see any problem here. We have global scopes, and we have local scopes. You have to look at how they interact and use them to solve your scoping problems.

Here you want to isolate lua with a local scope, but at the same time the lua community wants to use global scope binding to avoid lua modules depending directly on the lua DSO. So this conflicts with developer usage.

I believe another solution might be to implement RTLD_PARENT and RTLD_GROUP from Solaris to have better control over binding.

With RTLD_PARENT the caller of dlopen has it's symbols made available to the loaded scope. So lua would make it's symbols available to plugins, but no deeper.

With RTLD_GROUP you can make a closed set of symbol deps.

I'll leave this open for a while in case you want to discuss, but it will be closed as NOTABUG, or you can change it to an RFE for RTLD_PARENT.

Comment 5 Carlos O'Donell 2018-08-28 19:00:11 UTC
I'm retitling this to indicate a desire to have RTLD_PARENT which would allow LUA 's language to load other DSOs and share it's own symbols with them for relocation, but not for subsequent dlsym/dlvsym access. If that's not going to help your particular use case with LUA, then please provide a complete example for the use case you're trying to support.

Comment 10 Carlos O'Donell 2020-01-20 14:45:23 UTC
Given the complexity of implementing RTLD_PARENT this must be tracked upstream and fixed there first. Once those semantics are fixed upstream then they will be included in RHEL.

I've filed the following upstream bug for the glibc team to use:
https://sourceware.org/bugzilla/show_bug.cgi?id=25421

I'm marking this bug CLOSED/UPSTREAM. We are going to track this upstream.