From Bugzilla Helper: User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; MyIE2; .NET CLR 1.1.4322) Description of problem: When a share library performs during its destructors the sequence pthread_cancel(tid), pthread_join(tid) then according to the debugger the thread dies, but destructor functions registered during pthread_key_create are not executed, and pthread_join() hangs indefinitely. It is possible the debugger is not telling the complete truth and the thread is still alive. A sample program demonstrating this is attached. Version-Release number of selected component (if applicable): glibc-2.3.2-95.20 How reproducible: Always Steps to Reproduce: 1. extract the sample shar file by executing it 2. build the program using 'gmake' 3. setenv LD_LIBRARY_PATH . 4. ./exe Actual Results: Error: key destructor not called (and infinite hang) Expected Results: Success (and process exit) Additional info:
Created attachment 101183 [details] source of reproducer sample
I forgot to mention that this is a regression. In RH EL 3.0 update 1 the sample works (the bug does not occur) In RH EL 3.0 update 2 this bug is easily reproduced.
I wonder how this could work in U1. The problem is: 1) shared library destructors are executed with the dl_load_lock held to ensure no new shared libraries are loaded during running of the destructors. This is in the initial thread 2) when a thread is to be cancelled, it uses the unwinder in libgcc_s to unwind through the frames, run any pthread cleanups and class destructors on the way up 3) the unwinder in libgcc_s uses dl_iterate_phdr interface to query all currently loaded shared libraries (this is executed in the context of the child thread) 4) dl_iterate_phdr acquires the dl_load_lock, to make sure no new shared library is loaded and especially that no shared library is unloaded while executing this function. But, dl_load_lock, although it is a recursive lock, is already held by the initial thread, so the child thread gets stuck here until the initial thread releases it after it is done with its constructors
I think it worked in U1 because the regression was introduced by the patch glibc-dladdr-locking.patch of 2004-02-20. Removing this patch fixes the problem. I don't suppose this is the way you want to proceed though.
This should be fixed in U3. https://rhn.redhat.com/errata/RHBA-2004-384.html