Created attachment 1027321 [details] process stack trace Attached you can find a trace of the clinfo -> ocl-icd -> mesa-libOpenCL dlopen deadlock with glibc > 2.21.90-7 glibc-2.21.90-13.fc23.x86_64 clinfo-0.1-0.6.git20150215.94fdb47.fc23.x86_64 ocl-icd-2.2.4-1.git20150518.7c94f4a.fc23.x86_64 mesa-libOpenCL-10.6.0-0.devel.6.5a55f68.fc23.x86_64
Created attachment 1027326 [details] clfino output Attached is the stdout+stderr of clinfo with glibc-2.21.91-7
Note that, initially, ocl-icd was calling dlopen in its constructor. This is *not* the case anymore. ocl-icd now just call dlopen in a plain exported function. So, I do not think there is anything "behavior undefined" anymore in ocl-icd. But the library that is dlopen-ed calls itself stuff about threads in its constructor (at least pthread_join()). I do not know if this is allowed or not per standard (as a parallel programmer, I would expect yes so that runtime can create a pool of thread at initialization time). But, if not, I would be very interested by a list of functions that can be called in a library constructor.
I do not know the libc code but, from my limited knowledge from the analysis of this bug, I'm under the impression that dlopen() takes a TLS related lock during the whole time that the library constructor is executed. If this is really the case, I find this strange.
Siddhesh, This bug is for you. This is a combination of C++ thread_local destructors using the new registration interface in glibc (__cxa_thread_atexit_impl) and the example application calling dlopen to load a library that uses threads. Let me explain in a bit more detail: (a) libMesaOpenCL.so.1 is using both C++ thread_local destructors and threads. (b) The application calls a function which dlopen's libMesaOpenCL. In this situation the dl_load_lock is behing held by the presently executing dlopen. (c) libMesaOpenCL.so.1 relies on libpthread and llvm pipe, and this creates threads briefly and then destroys them, and then waits on them. (d) As libMesaOpenCL.so.1 waits on the threads, those threads in turn must carry out their destructors, and update the DSO state. Here is the problem. The exiting threads can't touch DSO state because they are created and destroyed while the load lock is held. This code: glibc/stdlib/cxa_thread_atexit_impl.c: 88 cur->func (cur->obj); 89 90 __rtld_lock_lock_recursive (GL(dl_load_lock)); 91 92 /* Allow DSO unload if count drops to zero. */ 93 cur->map->l_tls_dtor_count--; 94 if (cur->map->l_tls_dtor_count == 0 && cur->map->l_type == lt_loaded) 95 cur->map->l_flags_1 &= ~DF_1_NODELETE; 96 97 __rtld_lock_unlock_recursive (GL(dl_load_lock)); 98 99 free (cur); Is going to livelock with all threads waiting, and the parent waiting for all threads. It seems to me that running destructors cannot wait for the load lock because those destructors might be called while the load lock is held during dlopen. The load lock is recursive, but only for a given thread (allows a single thread to call dlopen recursively), but in this case the main thread hold the lock, while the created threads wait on it. I think the removal of DF_1_NODELETE needs to be deferred until dlcose() time. The execution of the desctructors can run in the correct thread at thread exit, but the modification of l_flags must be deferred. The thread knows that the DSO in it's list has a non-zero l_tls_dtor_count. Therefore it must still have DF_1_NODELETE and can't be removed. Therefore it is technically safe to carry out a synchronized atomic decrement against the l_tls_dor_count (synchronized against the ++ via acq/rel appropriately). However, as soon as you decrement, the DSO might be removed, and thus manipulating cur->map->l_flags_1 is impossible. The only solution I see is to delay the checking l_tls_dtor_count until dlclose for that DSO, at which point you check to see of l_tls_dtor_count was zero and then remove DF_1_NODELETE and remove the DSO. The immediate problem is that I don't know how you distinguish, easily, between "thread_local destructors set DF_1_NODELETE" or "something else set DF_1_NODELETE." You need to know the difference if you're going to remove the the NODELETE marker. I think it's easy enough to allocate 1 bit in l_tls_dtor_count to act as a "We set set the flag" bit. Either way, we need an immediate fix, and the easiest fix is to not unload the DSO and leave it loaded if there are any thread_local destructors.
I see it now - Jakub had suggested not unloading these DSOs at all and I've been wondering if that is in fact the best option. I can atomically decrement l_tls_dtor_count and decide on whether to unload during dlclose. I'll explore doing the latter.
(In reply to Siddhesh Poyarekar from comment #5) > I see it now - Jakub had suggested not unloading these DSOs at all and I've > been wondering if that is in fact the best option. I can atomically > decrement l_tls_dtor_count and decide on whether to unload during dlclose. > I'll explore doing the latter. Exactly, it becomes a QoI issue. You can fix the bug by simply not unloading the DSO, and then we can consider the complete solution.
This is much more complicated. The rtld lock may get taken at any time within the constructor, including the following cases: 1. The constructor instantiates a thread_local variable, i.e. registers a destructor using __cxa_thread_atexit 2. The constructor access a thread-local (could be __thread or thread_local) variable, which results in the TLS descriptor bits taking the rtld lock Funnily, I have not seen the livelock in the first case. i.e. with the __cxa_thread_atexit code base. Even the clinfo livelock is with the TLS descriptor code taking the rtld lock due to tls_dtor_list being a TLS variable. Also, libMesaOpenCL.so.1 is not really using thread_local destructors, __call_tls_dtors gets called all the time and is a nop if there are no destructors. So this could impact more than just mesa. in fact, it is possible to get a similar hangup with glibc-2.21.90-7.fc23: main.c: #include <stdio.h> #include <dlfcn.h> #include <assert.h> int main (void) { void *h = dlopen ("./mod1.so", RTLD_NOW | RTLD_GLOBAL); assert (h != NULL); return 0; } mod1.cc: #include <pthread.h> class A { private: int i; public: A () {i = 0;} ~A () {i = 42;} void hello (void) {} }; thread_local A a; static void * thr (void *u) { a.hello (); return NULL; } void __attribute__((constructor)) init (void) { pthread_t t; pthread_create (&t, NULL, thr, NULL); pthread_join (t, NULL); } Build mod1.cc with: g++ -std=c++11 -shared -fPIC -g -o mod1.so mod1.cc -pthread Latest master hangs even if mod1.cc just creates and waits for a thread, no thread_local variables required. Two things need to be done: 1. Avoid taking the rtld lock in the thread_local destructor code and 2. Avoid taking the rtld lock in TLS Let me see how we can manage to do that.
This bug appears to have been reported against 'rawhide' during the Fedora 23 development cycle. Changing version to '23'. (As we did not run this process for some time, it could affect also pre-Fedora 23 development cycle bugs. We are very sorry. It will help us with cleanup during Fedora 23 End Of Life. Thank you.) More information and reason for this action is here: https://fedoraproject.org/wiki/BugZappers/HouseKeeping/Fedora23
This should already be fixed in F23 and rawhide. Please test and reopen if the problem still occurs.