1223055 – clinfo dlopen deadlock with glibc > 2.21.90-7

Bug 1223055 - clinfo dlopen deadlock with glibc > 2.21.90-7

Summary: clinfo dlopen deadlock with glibc > 2.21.90-7

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	glibc
Sub Component:
Version:	23
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Assignee:	Siddhesh Poyarekar
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1219646 1222551
TreeView+	depends on / blocked

Reported:	2015-05-19 17:22 UTC by Yanko Kaneti
Modified:	2015-09-14 00:24 UTC (History)
CC List:	13 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2015-08-12 12:22:09 UTC
Type:	Bug
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
process stack trace (7.81 KB, text/plain) 2015-05-19 17:22 UTC, Yanko Kaneti	no flags	Details
clfino output (5.27 KB, text/plain) 2015-05-19 17:35 UTC, Yanko Kaneti	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Sourceware	18457	0	P2	RESOLVED	pthread_join deadlock in library destructor	2019-12-17 22:14:32 UTC

Description Yanko Kaneti 2015-05-19 17:22:36 UTC

Created attachment 1027321 [details]
process stack trace

Attached you can find a trace of the clinfo -> ocl-icd -> mesa-libOpenCL dlopen deadlock with glibc > 2.21.90-7

glibc-2.21.90-13.fc23.x86_64
clinfo-0.1-0.6.git20150215.94fdb47.fc23.x86_64
ocl-icd-2.2.4-1.git20150518.7c94f4a.fc23.x86_64
mesa-libOpenCL-10.6.0-0.devel.6.5a55f68.fc23.x86_64

Comment 1 Yanko Kaneti 2015-05-19 17:35:31 UTC

Created attachment 1027326 [details]
clfino output

Attached is the stdout+stderr of clinfo with glibc-2.21.91-7

Comment 2 Vincent 2015-05-19 18:37:02 UTC

Note that, initially, ocl-icd was calling dlopen in its constructor. This is *not* the case anymore. ocl-icd now just call dlopen in a plain exported function. So, I do not think there is anything "behavior undefined" anymore in ocl-icd.

But the library that is dlopen-ed calls itself stuff about threads in its constructor (at least pthread_join()). I do not know if this is allowed or not per standard (as a parallel programmer, I would expect yes so that runtime can create a pool of thread at initialization time). But, if not, I would be very interested by a list of functions that can be called in a library constructor.

Comment 3 Vincent 2015-05-19 18:47:36 UTC

I do not know the libc code but, from my limited knowledge from the analysis of this bug, I'm under the impression that dlopen() takes a TLS related lock during the whole time that the library constructor is executed. If this is really the case, I find this strange.

Comment 4 Carlos O'Donell 2015-05-19 21:08:28 UTC

Siddhesh,

This bug is for you.

This is a combination of C++ thread_local destructors using the new registration interface in glibc (__cxa_thread_atexit_impl) and the example application calling dlopen to load a library that uses threads.

Let me explain in a bit more detail:

(a) libMesaOpenCL.so.1 is using both C++ thread_local destructors and threads.

(b) The application calls a function which dlopen's libMesaOpenCL. In this situation the dl_load_lock is behing held by the presently executing dlopen.

(c) libMesaOpenCL.so.1 relies on libpthread and llvm pipe, and this creates threads briefly and then destroys them, and then waits on them.

(d) As libMesaOpenCL.so.1 waits on the threads, those threads in turn must carry out their destructors, and update the DSO state.

Here is the problem. The exiting threads can't touch DSO state because they are created and destroyed while the load lock is held.

This code:

glibc/stdlib/cxa_thread_atexit_impl.c:

 88       cur->func (cur->obj);
 89 
 90       __rtld_lock_lock_recursive (GL(dl_load_lock));
 91 
 92       /* Allow DSO unload if count drops to zero.  */
 93       cur->map->l_tls_dtor_count--;
 94       if (cur->map->l_tls_dtor_count == 0 && cur->map->l_type == lt_loaded)
 95         cur->map->l_flags_1 &= ~DF_1_NODELETE;
 96 
 97       __rtld_lock_unlock_recursive (GL(dl_load_lock));
 98 
 99       free (cur);

Is going to livelock with all threads waiting, and the parent waiting for all threads.

It seems to me that running destructors cannot wait for the load lock because those destructors might be called while the load lock is held during dlopen.  The load lock is recursive, but only for a given thread (allows a single thread to call dlopen recursively), but in this case the main thread hold the lock, while the created threads wait on it.

I think the removal of DF_1_NODELETE needs to be deferred until dlcose() time. The execution of the desctructors can run in the correct thread at thread exit, but the modification of l_flags must be deferred.

The thread knows that the DSO in it's list has a non-zero l_tls_dtor_count. Therefore it must still have DF_1_NODELETE and can't be removed. Therefore it is technically safe to carry out a synchronized atomic decrement against the l_tls_dor_count (synchronized against the ++ via acq/rel appropriately).

However, as soon as you decrement, the DSO might be removed, and thus manipulating cur->map->l_flags_1 is impossible. The only solution I see is to delay the checking l_tls_dtor_count until dlclose for that DSO, at which point you check to see of l_tls_dtor_count was zero and then remove DF_1_NODELETE and remove the DSO. The immediate problem is that I don't know how you distinguish, easily, between "thread_local destructors set DF_1_NODELETE" or "something else set DF_1_NODELETE." You need to know the difference if you're going to remove the the NODELETE marker. I think it's easy enough to allocate 1 bit in l_tls_dtor_count to act as a "We set set the flag" bit.

Either way, we need an immediate fix, and the easiest fix is to not unload the DSO and leave it loaded if there are any thread_local destructors.

Comment 5 Siddhesh Poyarekar 2015-05-20 18:43:27 UTC

I see it now - Jakub had suggested not unloading these DSOs at all and I've been wondering if that is in fact the best option.  I can atomically decrement l_tls_dtor_count and decide on whether to unload during dlclose.  I'll explore doing the latter.

Comment 6 Carlos O'Donell 2015-05-21 03:42:55 UTC

(In reply to Siddhesh Poyarekar from comment #5)
> I see it now - Jakub had suggested not unloading these DSOs at all and I've
> been wondering if that is in fact the best option.  I can atomically
> decrement l_tls_dtor_count and decide on whether to unload during dlclose. 
> I'll explore doing the latter.

Exactly, it becomes a QoI issue. You can fix the bug by simply not unloading the DSO, and then we can consider the complete solution.

Comment 7 Siddhesh Poyarekar 2015-05-21 09:14:21 UTC

This is much more complicated.  The rtld lock may get taken at any time within the constructor, including the following cases:

1. The constructor instantiates a thread_local variable, i.e. registers a destructor using __cxa_thread_atexit
2. The constructor access a thread-local (could be __thread or thread_local) variable, which results in the TLS descriptor bits taking the rtld lock

Funnily, I have not seen the livelock in the first case. i.e. with the __cxa_thread_atexit code base.  Even the clinfo livelock is with the TLS descriptor code taking the rtld lock due to tls_dtor_list being a TLS variable.

Also, libMesaOpenCL.so.1 is not really using thread_local destructors, __call_tls_dtors gets called all the time and is a nop if there are no destructors.  So this could impact more than just mesa.  in fact, it is possible to get a similar hangup with glibc-2.21.90-7.fc23:

main.c:

#include <stdio.h>
#include <dlfcn.h>
#include <assert.h>

int
main (void)
{
  void *h = dlopen ("./mod1.so", RTLD_NOW | RTLD_GLOBAL);
  assert (h != NULL);
  return 0;
}


mod1.cc:

#include <pthread.h>

class A
{
  private:
    int i;
  public:
    A () {i = 0;}
    ~A () {i = 42;}
    void hello (void) {}
};

thread_local A a;

static void *
thr (void *u)
{
  a.hello ();
  return NULL;
}

void
__attribute__((constructor))
init (void)
{
  pthread_t t;

  pthread_create (&t, NULL, thr, NULL);
  pthread_join (t, NULL);
}


Build mod1.cc with:

g++ -std=c++11 -shared -fPIC -g -o mod1.so mod1.cc -pthread

Latest master hangs even if mod1.cc just creates and waits for a thread, no thread_local variables required.

Two things need to be done:

1. Avoid taking the rtld lock in the thread_local destructor code and
2. Avoid taking the rtld lock in TLS

Let me see how we can manage to do that.

Comment 8 Jan Kurik 2015-07-15 14:07:56 UTC

This bug appears to have been reported against 'rawhide' during the Fedora 23 development cycle.
Changing version to '23'.

(As we did not run this process for some time, it could affect also pre-Fedora 23 development
cycle bugs. We are very sorry. It will help us with cleanup during Fedora 23 End Of Life. Thank you.)

More information and reason for this action is here:
https://fedoraproject.org/wiki/BugZappers/HouseKeeping/Fedora23

Comment 9 Siddhesh Poyarekar 2015-08-12 12:22:09 UTC

This should already be fixed in F23 and rawhide.  Please test and reopen if the problem still occurs.

Note You need to log in before you can comment on or make changes to this bug.