Note: This bug is displayed in read-only format because
the product is no longer active in Red Hat Bugzilla.
RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Created attachment 1236965[details]
dllock_bug.tar.gz
The reproducer is based on the attachment to
https://sourceware.org/bugzilla/show_bug.cgi?id=2377
The program runs like this:
1. A shared object is dlopen'ed.
2. A function from the shared oject is run.
3. This function creates a thread is created to run a "service".
4. The shared object is dlclose'd.
5. The shared object has a "fini" method.
6. The "fini" method class pthread_join.
7. The service receives a C++ exception.
8. The service handles the exception and exits.
The problem is that it blocks at 7.
This happens because dlclose runs "destructors" with dl_load_lock
acquired, and the C++ exception calls tls_get_addr_tail() that
dead locks attempting to acquire dl_load_lock.
This should be a variant of
https://sourceware.org/git/?p=glibc.git;a=commit;h=e400f3ccd36fe91d432cc7d45b4ccc799dece763
Now a temporary "workaround" could be to "fix" it in libstdc++,
but the issue would still happen if one has a static tls variable
that is first accessed in the "destructor", and after dlclose()
is called, so, a proper correction should be to have a second
mutex, for the use of tls_get_addr_tail() and others, and then
dlclose() would also acquire this mutex after run the destructors.
It is my understanding that this issue arises from a design issue in the dynamic linker. The dynamic linker invokes ELF constructors and destructors (essentially callback functions) while internal locks are acquired. To enable dynamic linker usage from these callback functions without self-deadlocking, those locks are recursive locks. However, recursive locks do not help if the callback spawns another thread to do the work, or hands off the work to another thread and waits until that thread signals the work is complete.
This code needs to be rewritten upstream, so that no locks are acquired while the dynamic linker calls into user code. To my knowledge, no such patches exist yet, and this has not worked in any glibc release.
(In reply to Florian Weimer from comment #3)
> It is my understanding that this issue arises from a design issue in the
> dynamic linker. The dynamic linker invokes ELF constructors and destructors
> (essentially callback functions) while internal locks are acquired. To
> enable dynamic linker usage from these callback functions without
> self-deadlocking, those locks are recursive locks. However, recursive locks
> do not help if the callback spawns another thread to do the work, or hands
> off the work to another thread and waits until that thread signals the work
> is complete.
Exactly right.
We have had several instances of this problem in the last year where applications are doing too much in constructors or destructors, and it usually involves starting, controlling, and stopping threads. This is quickly leads to deadlocks.
> This code needs to be rewritten upstream, so that no locks are acquired
> while the dynamic linker calls into user code. To my knowledge, no such
> patches exist yet, and this has not worked in any glibc release.
It is indeed a problem to call foreign functions (constructors and destructors) with locks held, and it is a long-term goal to simplify this in the dynamic loader to attempt to make it possible while still maintaining the consistency of the loaded libraries.
I have also never seen any patches to fix this upstream. The closest was a discussion I had with Mathieu Desnoyers (lttng, liburcu) at LPC 2016 where I discussed the use of liburcu and RCU in general to break the internal dynamic loader load lock.
As Florian points out, this has never worked, and if anything this is a new feature request.
Given that this bug requires extensive work upstream to resolve I'm going to mark this as CLOSED/UPSTREAM.
We are going to review and track the bug upstream here:
https://sourceware.org/bugzilla/show_bug.cgi?id=15686
The solution upstream is that we have to be able to release the locks while constructors run, but this is quite a hard thing to implement and keep the expected semantics.