1409899 – glibc: [RFE] Deadlock with dlclose call

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1409899 - glibc: [RFE] Deadlock with dlclose call

Summary: glibc: [RFE] Deadlock with dlclose call

Keywords:
Status:	CLOSED UPSTREAM
Alias:	None
Product:	Red Hat Enterprise Linux 8
Classification:	Red Hat
Component:	glibc
Sub Component:
Version:	8.2
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	rc
Target Release:	8.2
Assignee:	glibc team
QA Contact:	qe-baseos-tools-bugs
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1420851 1477664
TreeView+	depends on / blocked

Reported:	2017-01-03 19:35 UTC by Paulo Andrade
Modified:	2023-09-07 18:49 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	Enhancement
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-03-02 15:44:25 UTC
Type:	Bug
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
dllock_bug.tar.gz (2.63 KB, application/x-gzip) 2017-01-03 19:35 UTC, Paulo Andrade	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Sourceware	15686	0	P2	NEW	Shared-object static constructors called with a lock held	2022-11-08 19:00:20 UTC
Sourceware	19448	0	P2	RESOLVED	deadlock in dlopen when ctor calls dlopen in another thread	2022-11-08 19:00:22 UTC

Description Paulo Andrade 2017-01-03 19:35:03 UTC

Created attachment 1236965 [details]
dllock_bug.tar.gz

The reproducer is based on the attachment to
https://sourceware.org/bugzilla/show_bug.cgi?id=2377

  The program runs like this:

1.  A shared object is dlopen'ed.
2.  A function from the shared oject is run.
3.  This function creates a thread is created to run a "service".
4.  The shared object is dlclose'd.
5.  The shared object has a "fini" method.
6.  The "fini" method class pthread_join.
7.  The service receives a C++ exception.
8.  The service handles the exception and exits.

  The problem is that it blocks at 7.
  This happens because dlclose runs "destructors" with dl_load_lock
acquired, and the C++ exception calls tls_get_addr_tail() that
dead locks attempting to acquire dl_load_lock.

  This should be a variant of
https://sourceware.org/git/?p=glibc.git;a=commit;h=e400f3ccd36fe91d432cc7d45b4ccc799dece763
Now a temporary "workaround" could be to "fix" it in libstdc++,
but the issue would still happen if one has a static tls variable
that is first accessed in the "destructor", and after dlclose()
is called, so, a proper correction should be to have a second
mutex, for the use of tls_get_addr_tail() and others, and then
dlclose() would also acquire this mutex after run the destructors.

Comment 3 Florian Weimer 2017-06-12 16:09:44 UTC

It is my understanding that this issue arises from a design issue in the dynamic linker.  The dynamic linker invokes ELF constructors and destructors (essentially callback functions) while internal locks are acquired.  To enable dynamic linker usage from these callback functions without self-deadlocking, those locks are recursive locks.  However, recursive locks do not help if the callback spawns another thread to do the work, or hands off the work to another thread and waits until that thread signals the work is complete.

This code needs to be rewritten upstream, so that no locks are acquired while the dynamic linker calls into user code.  To my knowledge, no such patches exist yet, and this has not worked in any glibc release.

Comment 4 Carlos O'Donell 2017-06-13 19:46:12 UTC

(In reply to Florian Weimer from comment #3)
> It is my understanding that this issue arises from a design issue in the
> dynamic linker.  The dynamic linker invokes ELF constructors and destructors
> (essentially callback functions) while internal locks are acquired.  To
> enable dynamic linker usage from these callback functions without
> self-deadlocking, those locks are recursive locks.  However, recursive locks
> do not help if the callback spawns another thread to do the work, or hands
> off the work to another thread and waits until that thread signals the work
> is complete.

Exactly right.

We have had several instances of this problem in the last year where applications are doing too much in constructors or destructors, and it usually involves starting, controlling, and stopping threads. This is quickly leads to deadlocks.
 
> This code needs to be rewritten upstream, so that no locks are acquired
> while the dynamic linker calls into user code.  To my knowledge, no such
> patches exist yet, and this has not worked in any glibc release.

It is indeed a problem to call foreign functions (constructors and destructors) with locks held, and it is a long-term goal to simplify this in the dynamic loader to attempt to make it possible while still maintaining the consistency of the loaded libraries.

I have also never seen any patches to fix this upstream. The closest was a discussion I had with Mathieu Desnoyers (lttng, liburcu) at LPC 2016 where I discussed the use of liburcu and RCU in general to break the internal dynamic loader load lock.

As Florian points out, this has never worked, and if anything this is a new feature request.

Comment 5 Chris Williams 2017-06-22 18:29:33 UTC

Changing this to an RFE based on comment 4.

Comment 9 Carlos O'Donell 2020-03-02 15:44:25 UTC

Given that this bug requires extensive work upstream to resolve I'm going to mark this as CLOSED/UPSTREAM.

We are going to review and track the bug upstream here:
https://sourceware.org/bugzilla/show_bug.cgi?id=15686

The solution upstream is that we have to be able to release the locks while constructors run, but this is quite a hard thing to implement and keep the expected semantics.

Note You need to log in before you can comment on or make changes to this bug.