Created attachment 1236965 [details] dllock_bug.tar.gz The reproducer is based on the attachment to https://sourceware.org/bugzilla/show_bug.cgi?id=2377 The program runs like this: 1. A shared object is dlopen'ed. 2. A function from the shared oject is run. 3. This function creates a thread is created to run a "service". 4. The shared object is dlclose'd. 5. The shared object has a "fini" method. 6. The "fini" method class pthread_join. 7. The service receives a C++ exception. 8. The service handles the exception and exits. The problem is that it blocks at 7. This happens because dlclose runs "destructors" with dl_load_lock acquired, and the C++ exception calls tls_get_addr_tail() that dead locks attempting to acquire dl_load_lock. This should be a variant of https://sourceware.org/git/?p=glibc.git;a=commit;h=e400f3ccd36fe91d432cc7d45b4ccc799dece763 Now a temporary "workaround" could be to "fix" it in libstdc++, but the issue would still happen if one has a static tls variable that is first accessed in the "destructor", and after dlclose() is called, so, a proper correction should be to have a second mutex, for the use of tls_get_addr_tail() and others, and then dlclose() would also acquire this mutex after run the destructors.
It is my understanding that this issue arises from a design issue in the dynamic linker. The dynamic linker invokes ELF constructors and destructors (essentially callback functions) while internal locks are acquired. To enable dynamic linker usage from these callback functions without self-deadlocking, those locks are recursive locks. However, recursive locks do not help if the callback spawns another thread to do the work, or hands off the work to another thread and waits until that thread signals the work is complete. This code needs to be rewritten upstream, so that no locks are acquired while the dynamic linker calls into user code. To my knowledge, no such patches exist yet, and this has not worked in any glibc release.
(In reply to Florian Weimer from comment #3) > It is my understanding that this issue arises from a design issue in the > dynamic linker. The dynamic linker invokes ELF constructors and destructors > (essentially callback functions) while internal locks are acquired. To > enable dynamic linker usage from these callback functions without > self-deadlocking, those locks are recursive locks. However, recursive locks > do not help if the callback spawns another thread to do the work, or hands > off the work to another thread and waits until that thread signals the work > is complete. Exactly right. We have had several instances of this problem in the last year where applications are doing too much in constructors or destructors, and it usually involves starting, controlling, and stopping threads. This is quickly leads to deadlocks. > This code needs to be rewritten upstream, so that no locks are acquired > while the dynamic linker calls into user code. To my knowledge, no such > patches exist yet, and this has not worked in any glibc release. It is indeed a problem to call foreign functions (constructors and destructors) with locks held, and it is a long-term goal to simplify this in the dynamic loader to attempt to make it possible while still maintaining the consistency of the loaded libraries. I have also never seen any patches to fix this upstream. The closest was a discussion I had with Mathieu Desnoyers (lttng, liburcu) at LPC 2016 where I discussed the use of liburcu and RCU in general to break the internal dynamic loader load lock. As Florian points out, this has never worked, and if anything this is a new feature request.
Changing this to an RFE based on comment 4.
Given that this bug requires extensive work upstream to resolve I'm going to mark this as CLOSED/UPSTREAM. We are going to review and track the bug upstream here: https://sourceware.org/bugzilla/show_bug.cgi?id=15686 The solution upstream is that we have to be able to release the locks while constructors run, but this is quite a hard thing to implement and keep the expected semantics.