Bug 1300049

Summary: dlerror () returns NULL after dlsym (RTLD_NEXT) of a non-existent symbol
Product: Red Hat Enterprise Linux 7 Reporter: Joe Wright <jwright>
Component: glibcAssignee: Florian Weimer <fweimer>
Status: CLOSED WONTFIX QA Contact: qe-baseos-tools-bugs
Severity: medium Docs Contact:
Priority: unspecified    
Version: 7.3CC: ashankar, cww, fweimer, mnewsome, pfrankli
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1333945 (view as bug list) Environment:
Last Closed: 2016-05-12 14:14:22 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1333945    
Bug Blocks: 1203710    

Description Joe Wright 2016-01-19 20:24:48 UTC
Description of problem:
- Shouldn't the dlerror() return something meaningful instead of 0 after dlsym a non existent symbol? The current behavior also contradicts with the man page.

Version-Release number of selected component (if applicable):


How reproducible:
- consistently

Steps to Reproduce:

This prints all zeros:
std::cout << (void*)dlerror() << std::endl;
std::cout << dlsym(RTLD_NEXT, "does_not_exist") << std::endl;
std::cout << (void*)dlerror() << std::endl;
std::cout << dlvsym(RTLD_NEXT, "pthread_cond_timedwait", "DOES_NOT_EXIST") << std::endl;
std::cout << (void*)dlerror() << std::endl;

/// a.C
#include <iostream>
#include <dlfcn.h>
int main()
{
  std::cout << (void*)dlerror() << std::endl;
  std::cout << dlsym(RTLD_NEXT, "does_not_exist") << std::endl;
  std::cout << (void*)dlerror() << std::endl;
  std::cout << dlvsym(RTLD_NEXT, "pthread_cond_timedwait", "DOES_NOT_EXIST") << std::endl;
  std::cout << (void*)dlerror() << std::endl;
}

Run commands:
  g++ a.C -ldl 
  ./a.out

$ ldd -r ./a.out
        linux-vdso.so.1 =>  (0x00007ffccbebe000)
        libdl.so.2 => /lib64/libdl.so.2 (0x0000003fa5400000)
        libstdc++.so.6 => /usr/lib64/libstdc++.so.6 (0x0000003fab400000)
        libm.so.6 => /lib64/libm.so.6 (0x0000003fa5800000)
        libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x0000003fa8c00000)
        libc.so.6 => /lib64/libc.so.6 (0x0000003fa4c00000)
        /lib64/ld-linux-x86-64.so.2 (0x0000003fa4800000)
[alanm@vmw102 C01555171]$ ./a.out
0
0
0
0
0
LD_DEBUG=symbols,bindings ./a.out


     57410:     symbol=_res;  lookup in file=./a.out [0]
     57410:     symbol=_res;  lookup in file=/lib64/libdl.so.2 [0]
     57410:     symbol=_res;  lookup in file=/usr/lib64/libstdc++.so.6 [0]
     57410:     symbol=_res;  lookup in file=/lib64/libm.so.6 [0]
     57410:     symbol=_res;  lookup in file=/lib64/libgcc_s.so.1 [0]
     57410:     symbol=_res;  lookup in file=/lib64/libc.so.6 [0]
     57410:     binding file /lib64/libc.so.6 [0] to /lib64/libc.so.6 [0]: normal symbol `_res' [GLIBC_2.2.5]
     57410:     symbol=_IO_file_close;  lookup in file=./a.out [0]
     57410:     symbol=_IO_file_close;  lookup in file=/lib64/libdl.so.2 [0]
     57410:     symbol=_IO_file_close;  lookup in file=/usr/lib64/libstdc++.so.6 [0]
     57410:     symbol=_IO_file_close;  lookup in file=/lib64/libm.so.6 [0]
     57410:     symbol=_IO_file_close;  lookup in file=/lib64/libgcc_s.so.1 [0]
     57410:     symbol=_IO_file_close;  lookup in file=/lib64/libc.so.6 [0]
     57410:     binding file /lib64/libc.so.6 [0] to /lib64/libc.so.6 [0]: normal symbol `_IO_file_close' [GLIBC_2.2.5]
     57410:     symbol=rpc_createerr;  lookup in file=./a.out [0]
     57410:     symbol=rpc_createerr;  lookup in file=/lib64/libdl.so.2 [0]
     57410:     symbol=rpc_createerr;  lookup in file=/usr/lib64/libstdc++.so.6 [0]
     57410:     symbol=rpc_createerr;  lookup in file=/lib64/libm.so.6 [0]
     57410:     symbol=rpc_createerr;  lookup in file=/lib64/libgcc_s.so.1 [0]
     57410:     symbol=rpc_createerr;  lookup in file=/lib64/libc.so.6 [0]

.....




Where are you experiencing the behavior?  What environment?

Shouldn't the dlerror() return something meaningful instead of 0 after dlsym a non existent symbol? The current behavior also contradicts with the man page.

Actual results:
- returns 0

Expected results:
- The posix spec says:
If handle does not refer to a valid symbol table handle or if the symbol named by name cannot be found in the symbol table associated with handle, dlsym() shall return a null pointer.

Additional info:

I'm not sure about the expected behavior but I do see RTDL_DEFAULT behaving:


RTDL_NEXT

  [...]
     11751:     binding file /lib64/libdl.so.2 [0] to /lib64/libc.so.6 [0]: normal symbol `_dl_sym' [GLIBC_PRIVATE]
     11751:     symbol=does_not_exist;  lookup in file=/lib64/libdl.so.2 [0]
     11751:     symbol=does_not_exist;  lookup in file=/usr/lib64/libstdc++.so.6 [0]
     11751:     symbol=does_not_exist;  lookup in file=/lib64/libm.so.6 [0]
     11751:     symbol=does_not_exist;  lookup in file=/lib64/libgcc_s.so.1 [0]
     11751:     symbol=does_not_exist;  lookup in file=/lib64/libc.so.6 [0]
     11751:     symbol=does_not_exist;  lookup in file=/lib64/ld-linux-x86-64.so.2 [0]
0
0


RTDL_DEFAULT

     11790:     binding file /lib64/libdl.so.2 [0] to /lib64/libc.so.6 [0]: normal symbol `_dl_sym' [GLIBC_PRIVATE]
     11790:     symbol=does_not_exist;  lookup in file=./a.out [0]
     11790:     symbol=does_not_exist;  lookup in file=/lib64/libdl.so.2 [0]
     11790:     symbol=does_not_exist;  lookup in file=/usr/lib64/libstdc++.so.6 [0]
     11790:     symbol=does_not_exist;  lookup in file=/lib64/libm.so.6 [0]
     11790:     symbol=does_not_exist;  lookup in file=/lib64/libgcc_s.so.1 [0]
     11790:     symbol=does_not_exist;  lookup in file=/lib64/libc.so.6 [0]
     11790:     symbol=does_not_exist;  lookup in file=/lib64/ld-linux-x86-64.so.2 [0]
     11790:     ./a.out: error: symbol lookup error: undefined symbol: does_not_exist (fatal)
0
   [...]
     11793:     symbol=free;  lookup in file=/lib64/libc.so.6 [0]
     11793:     binding file /lib64/libdl.so.2 [0] to /lib64/libc.so.6 [0]: normal symbol `free' [GLIBC_2.2.5]
0x23b80c0


Didn't find anything on quick googling 

https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=430732

Even in the latest version it's still noted that they
> .. are reserved for future use as special values that applications may be allowed to use for handle.
http://pubs.opengroup.org/onlinepubs/9699919799/


The posix spec says:
If handle does not refer to a valid symbol table handle or if the symbol named by name cannot be found in the symbol table associated with handle, dlsym() shall return a null pointer.

More detailed diagnostic information shall be available through dlerror().
and the dlerror() page says:
If no dynamic linking errors have occurred since the last invocation of dlerror(), dlerror() shall return NULL.
If successful, dlerror() shall return a null-terminated character string; otherwise, NULL shall be returned.

In this specific case, the symbol does not exist, so it returns 0, which is good. But why dlerror() also returns NULL is a bit confusing.

Comment 3 Florian Weimer 2016-01-19 21:18:12 UTC
Looking at _dl_lookup_symbol_x, this may indeed be a bug due to the way RTLD_NEXT is implemented: it continues after lookup errors, but this way, it never signals the error.

Comment 4 Florian Weimer 2016-01-19 21:55:44 UTC
On the other hand, the error is deliberately masked here for the RTLD_NEXT case (where skip_map == NULL):

    858   if (__glibc_unlikely (current_value.s == NULL))
    859     {
    860       if ((*ref == NULL || ELFW(ST_BIND) ((*ref)->st_info) != STB_WEAK)
    861           && skip_map == NULL
    862           && !(GLRO(dl_debug_mask) & DL_DEBUG_UNUSED))
    863         {
    864           /* We could find no value for a strong reference.  */
    865           const char *reference_name = undef_map ? undef_map->l_name : "";
    866           const char *versionstr = version ? ", version " : "";
    867           const char *versionname = (version && version->name
    868                                      ? version->name : "");
    869 
    870           /* XXX We cannot translate the message.  */
    871           _dl_signal_cerror (0, DSO_FILENAME (reference_name),
    872                              N_("symbol lookup error"),
    873                              make_string ("undefined symbol: ", undef_name,
    874                                           versionstr, versionname));
    875         }
    876       *ref = NULL;
    877       return 0;
    878     }

This was carried over from _dl_lookup_symbol_skip when the separate function was removed in upstream commit  bdf4a4f1eabb2e085b0610b53bb37b5263f4728d.  The original implementation of _dl_lookup_symbol_skip in commit 84384f5b6aaa622236ada8c9a7ff51f40b91fc20 did not have error reporting, either.  Why this is so is unclear to me.

Solaris documentation implies that the dlerror return value changes if dlsym with RTLD_NEXT is unsuccessful.  Therefore, I think we should change glibc behavior.

Comment 8 Carlos O'Donell 2016-01-20 04:32:54 UTC
The return of NULL from dlsym or dlvsym is sufficient to indicate the symbol was not found.

Yet, there are two more cases of interest that I can see:

(1) Return alternate errors other than "not found"

This is one of the only reasonable reasons to want this fixed. The functions have run into a serious internal error and reporting it can be done via dlerror.

(2) Support NULL symbols.

One might argue that this doesn't support distinguishing between a true "null" symbol, a symbol whose address is 0x0, versus a not-found symbol, and that's true. 

At present, such a symbol can only, as far as I know, be constructed artificially via a linker script (as a NOTYPE symbol via PROVIDE e.g. PROVIDE(null_symbol = 0x0);) or via special section directives and assembly.

Relocations against such symbols will fail today (abort ld.so) because the dynamic loader cannot handle such true "null" symbols.

e.g.
     11127:	symbol=null_symbol;  lookup in file=./test [0]
     11127:	symbol=null_symbol;  lookup in file=./libinterposer.so [0]
     11127:	symbol=null_symbol;  lookup in file=/lib64/libdl.so.2 [0]
     11127:	symbol=null_symbol;  lookup in file=/lib64/libc.so.6 [0]
     11127:	symbol=null_symbol;  lookup in file=/lib64/ld-linux-x86-64.so.2 [0]
     11127:	./libinterposer.so: error: symbol lookup error: undefined symbol: null_symbol (fatal)
./test: symbol lookup error: ./libinterposer.so: undefined symbol: null_symbol

readelf -a -W libinterposer.so | grep null
0000000000600fd8  0000000d00000006 R_X86_64_GLOB_DAT      0000000000000000 null_symbol + 0
    13: 0000000000000000     0 NOTYPE  GLOBAL DEFAULT  ABS null_symbol
    50: 0000000000000000     0 NOTYPE  GLOBAL DEFAULT  ABS null_symbol

Even if we fix the dynamic loader, the result returned from dlsym will be non-null because it will have the load offset added. Therefore the only way to get a true "null" symbol is to enable low addresses, and map the DSO at address zero for the symbol to exist. I see no useful reason to do this in a sensible application. Therefore if one sees a null return from dlsym et. al. then it means the symbol was not found, and barring (1), it really means "symbol not found".

Comment 9 Carlos O'Donell 2016-02-09 04:01:20 UTC
It is most likely a QoI issue that we should fix, with 'not found' being returned in dlerror() being the highest quality implementation. However, this has been the case forever on Linux, and I expect the man page text is a holdover from Solaris or AIX where it might have been possible to get a valid NULL symbol. It's almost never the case that you'll have a valid NULL symbol in Linux (at lest no easily), but rather than change the man page we should adjust the dlsym and dlvsym code to improve the implementation.

This has to go through upstream, and it will change the semantic behaviour of dlsym and dlvsym, which might impact some applciations. This needs testing on dlmopen also to test all the code paths.

This is not going to fit into a rhel-7.3 timeframe, so this will have to be rhel-7.4 or later.

Comment 10 Florian Weimer 2016-02-09 14:32:23 UTC
Patch posted upstream for review:

  https://sourceware.org/ml/libc-alpha/2016-02/msg00172.html

Comment 12 Florian Weimer 2016-05-10 11:52:35 UTC
This bug fix has the potential to break Address Sanitizer:

  https://llvm.org/bugs/show_bug.cgi?id=27310

I think it's not really defined what ASAN is doing (you need to have a working malloc when you call dlsym), but the question is if this kind of breakage is worth fixing this bug.

Comment 13 Florian Weimer 2016-05-10 11:55:25 UTC
Typical error message:

==10293==AddressSanitizer CHECK failed: ../../../../libsanitizer/asan/asan_rtl.cc:556 "((!asan_init_is_running && "ASan init calls itself!")) != (0)" (0x0, 0x0)
    <empty stack>

Comment 14 Florian Weimer 2016-05-12 14:14:22 UTC
Unfortunately, we cannot address this issue in Red Hat Enterprise Linux 7 because Address Sanitizer (ASAN) depends on dlsym (RTLD_NEXT) not providing an error message (see comment 12).  This affects both the Address Sanitizer version in GCC, and the version in LLVM/Clang.

There is also at least one more application which is confused by the more accurate dlerror reporting (fakeoot, often used for building software packages).

This means that the risk of introducing regressions is just too high to implement this change in Red Hat Enterprise Linux 7.

We already address this issue in upstream glibc, so future versions of Red Hat Enterprise Linux will very likely address this issue.