Bug 1409899 - glibc: [RFE] Deadlock with dlclose call
Summary: glibc: [RFE] Deadlock with dlclose call
Keywords:
Status: CLOSED UPSTREAM
Alias: None
Product: Red Hat Enterprise Linux 8
Classification: Red Hat
Component: glibc
Version: 8.2
Hardware: All
OS: Linux
medium
medium
Target Milestone: rc
: 8.2
Assignee: glibc team
QA Contact: qe-baseos-tools-bugs
URL:
Whiteboard:
Depends On:
Blocks: 1420851 1477664
TreeView+ depends on / blocked
 
Reported: 2017-01-03 19:35 UTC by Paulo Andrade
Modified: 2020-11-14 06:37 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: Enhancement
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-03-02 15:44:25 UTC
Type: Bug
Target Upstream Version:


Attachments (Terms of Use)
dllock_bug.tar.gz (2.63 KB, application/x-gzip)
2017-01-03 19:35 UTC, Paulo Andrade
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Sourceware 15686 0 P2 NEW Shared-object static constructors called with a lock held 2020-07-13 13:17:35 UTC
Sourceware 19448 0 P2 RESOLVED deadlock in dlopen when ctor calls dlopen in another thread 2020-07-13 13:17:36 UTC

Description Paulo Andrade 2017-01-03 19:35:03 UTC
Created attachment 1236965 [details]
dllock_bug.tar.gz

The reproducer is based on the attachment to
https://sourceware.org/bugzilla/show_bug.cgi?id=2377

  The program runs like this:

1.  A shared object is dlopen'ed.
2.  A function from the shared oject is run.
3.  This function creates a thread is created to run a "service".
4.  The shared object is dlclose'd.
5.  The shared object has a "fini" method.
6.  The "fini" method class pthread_join.
7.  The service receives a C++ exception.
8.  The service handles the exception and exits.

  The problem is that it blocks at 7.
  This happens because dlclose runs "destructors" with dl_load_lock
acquired, and the C++ exception calls tls_get_addr_tail() that
dead locks attempting to acquire dl_load_lock.

  This should be a variant of
https://sourceware.org/git/?p=glibc.git;a=commit;h=e400f3ccd36fe91d432cc7d45b4ccc799dece763
Now a temporary "workaround" could be to "fix" it in libstdc++,
but the issue would still happen if one has a static tls variable
that is first accessed in the "destructor", and after dlclose()
is called, so, a proper correction should be to have a second
mutex, for the use of tls_get_addr_tail() and others, and then
dlclose() would also acquire this mutex after run the destructors.

Comment 3 Florian Weimer 2017-06-12 16:09:44 UTC
It is my understanding that this issue arises from a design issue in the dynamic linker.  The dynamic linker invokes ELF constructors and destructors (essentially callback functions) while internal locks are acquired.  To enable dynamic linker usage from these callback functions without self-deadlocking, those locks are recursive locks.  However, recursive locks do not help if the callback spawns another thread to do the work, or hands off the work to another thread and waits until that thread signals the work is complete.

This code needs to be rewritten upstream, so that no locks are acquired while the dynamic linker calls into user code.  To my knowledge, no such patches exist yet, and this has not worked in any glibc release.

Comment 4 Carlos O'Donell 2017-06-13 19:46:12 UTC
(In reply to Florian Weimer from comment #3)
> It is my understanding that this issue arises from a design issue in the
> dynamic linker.  The dynamic linker invokes ELF constructors and destructors
> (essentially callback functions) while internal locks are acquired.  To
> enable dynamic linker usage from these callback functions without
> self-deadlocking, those locks are recursive locks.  However, recursive locks
> do not help if the callback spawns another thread to do the work, or hands
> off the work to another thread and waits until that thread signals the work
> is complete.

Exactly right.

We have had several instances of this problem in the last year where applications are doing too much in constructors or destructors, and it usually involves starting, controlling, and stopping threads. This is quickly leads to deadlocks.
 
> This code needs to be rewritten upstream, so that no locks are acquired
> while the dynamic linker calls into user code.  To my knowledge, no such
> patches exist yet, and this has not worked in any glibc release.

It is indeed a problem to call foreign functions (constructors and destructors) with locks held, and it is a long-term goal to simplify this in the dynamic loader to attempt to make it possible while still maintaining the consistency of the loaded libraries.

I have also never seen any patches to fix this upstream. The closest was a discussion I had with Mathieu Desnoyers (lttng, liburcu) at LPC 2016 where I discussed the use of liburcu and RCU in general to break the internal dynamic loader load lock.

As Florian points out, this has never worked, and if anything this is a new feature request.

Comment 5 Chris Williams 2017-06-22 18:29:33 UTC
Changing this to an RFE based on comment 4.

Comment 9 Carlos O'Donell 2020-03-02 15:44:25 UTC
Given that this bug requires extensive work upstream to resolve I'm going to mark this as CLOSED/UPSTREAM.

We are going to review and track the bug upstream here:
https://sourceware.org/bugzilla/show_bug.cgi?id=15686

The solution upstream is that we have to be able to release the locks while constructors run, but this is quite a hard thing to implement and keep the expected semantics.


Note You need to log in before you can comment on or make changes to this bug.