Bug 2189923

Summary: glibc: Lazy binding during dlclose may fail unnecessarily
Product: Red Hat Enterprise Linux 9 Reporter: Paulo Andrade <pandrade>
Component: glibcAssignee: Carlos O'Donell <codonell>
Status: VERIFIED --- QA Contact: Sergey Kolosov <skolosov>
Severity: medium Docs Contact:
Priority: medium    
Version: 9.3CC: alanm, amike, ashankar, brclark, casantos, codonell, dj, dmalcolm, fweimer, jakub, jwright, mkielian, mkolbas, ohudlick, pandrade, pfrankli, sipoyare, skolosov
Target Milestone: rcKeywords: Bugfix, Triaged
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: glibc-2.34-71.el9 Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
symbol-resolution-segfault-master.zip none

Description Paulo Andrade 2023-04-26 12:58:53 UTC
Created attachment 1960126 [details]
symbol-resolution-segfault-master.zip

Customer describe his findings in the README.md file.

Basically, customer has a dlopened shared library that defines a global
C++ object in a "child" library.

One of the source code files is compiled without optimization, that creates
a destructor function.

The other is compiled with optimizations, what inlines the destructor.
When unloading the libraries.

Basically there is a incorrect dependency on destructors, or wrong calls.

We suggested he user to make sure to build everything with the same
compiler options, but they said it is not under their control. This is
because the actual crashes, not the sample, is caused by STL and they do
not have control on those. Due to this, customer said they are needing
to compile disabling inlining of everything.

For the sample reproducer one possible workaround would be to add
__attribute__((noinline)) to Member::~Member(), but as customer said,
they do not have control of the STL source code.

Do you have some suggestion of possible workarounds, or better explain
the issue to the customer?

Comment 3 Florian Weimer 2023-04-26 17:03:40 UTC
This looks like this needs an upstream discussion. Would the customer be willing to raise this on libc-alpha, along with their reproducer? This is great stuff, it would be a loss to hide it in a private bug somewhere.

It's possible to avoid the problematic interposition with -Wl,-Bsymbolic, not just with -Wl,-z,now. (-flto -fno-semantic-interposition should work as well, but I haven't tested it yet.)

There are a bunch of oddities here. The l_removed check in elf/dl-lookup.c was effectively introduced to fix this upstream bug: https://sourceware.org/bugzilla/show_bug.cgi?id=821 But it does not contain much information why this was needed. There should not be any harm to bind about-to-be unloaded modules to about-to-be-unloaded modules in principle, but there might be some subtle corner-cases involving recursive dlclose calls.

It's unfortunate that we flag the definition of _ZN6MemberD1Ev as weak in libparent.so:

$ eu-readelf --symbols=.dynsym libparent.so | grep _ZN6MemberD1Ev
   12: 0000000000001186     21 FUNC    WEAK   DEFAULT       12 _ZN6MemberD1Ev

This comes from the weak symbols in parent_pic.o, which are used to implement C++ vague linkage. If we had a non-weak definition here, at least we'd get a proper ld.so error message, instead of the crash due to a jump to address zero.

We should be clear that given the product life-cycle, this issue will not be addressed in Red Hat Enterprise Linux 7. We need to see what we can do about later releases, but even 8 will be tough, especially if the changes required to support this scenario turn out to be invasive.

Comment 4 Paulo Andrade 2023-04-26 17:24:49 UTC
Many thanks again for the feedback Florian!

Customer allowed making the reproducer public.

I preferred to make the bug report private at first due to uncertainty
of possible workarounds, if any, or if the response would just be to
not mix compiler options.

Indeed the 3 suggested workarounds correct the issue at least for the
sample reproducer.

I will let the customer know about the possible workarounds and suggestion
to raise it on glibc-alpha.

Comment 5 Florian Weimer 2023-05-05 20:20:35 UTC
I'm treating this mostly as a glibc bug. The weak symbol is a GCC issue, but not the core problem.

The patch below (against upstream) fixes the reproducer.  Admittedly it's a bit silly.

diff --git a/elf/dl-lookup.c b/elf/dl-lookup.c
index 05f36a2507..ecc1896c9d 100644
--- a/elf/dl-lookup.c
+++ b/elf/dl-lookup.c
@@ -366,8 +366,10 @@ do_lookup_x (const char *undef_name, unsigned int new_hash,
       if ((type_class & ELF_RTYPE_CLASS_COPY) && map->l_type == lt_executable)
 	continue;
 
-      /* Do not look into objects which are going to be removed.  */
-      if (map->l_removed)
+      /* Do not look into objects which are going to be removed.
+	 References from a map that is under removal are allowed, to
+	 enable lazy binding during dlclose.  */
+      if (map->l_removed && !undef_map->l_removed)
 	continue;
 
       /* Print some debugging info if wanted.  */

I think it's quite low-risk. If we are worried about undef_map being NULL for some reason, we could a check for that after backporting.