From Bugzilla Helper:
User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1;
MyIE2; .NET CLR 1.1.4322)
Description of problem:
When a share library performs during its destructors the sequence
pthread_cancel(tid), pthread_join(tid) then according to the debugger
the thread dies, but destructor functions registered during
pthread_key_create are not executed, and pthread_join() hangs
indefinitely. It is possible the debugger is not telling the complete
truth and the thread is still alive.
A sample program demonstrating this is attached.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. extract the sample shar file by executing it
2. build the program using 'gmake'
3. setenv LD_LIBRARY_PATH .
Actual Results: Error: key destructor not called
(and infinite hang)
Expected Results: Success
(and process exit)
Created attachment 101183 [details]
source of reproducer sample
I forgot to mention that this is a regression.
In RH EL 3.0 update 1 the sample works (the bug does not occur)
In RH EL 3.0 update 2 this bug is easily reproduced.
I wonder how this could work in U1.
The problem is:
1) shared library destructors are executed with the dl_load_lock
held to ensure no new shared libraries are loaded during running
of the destructors.
This is in the initial thread
2) when a thread is to be cancelled, it uses the unwinder in libgcc_s
to unwind through the frames, run any pthread cleanups and class
destructors on the way up
3) the unwinder in libgcc_s uses dl_iterate_phdr interface to query
all currently loaded shared libraries (this is executed in the
context of the child thread)
4) dl_iterate_phdr acquires the dl_load_lock, to make sure no new
shared library is loaded and especially that no shared library
is unloaded while executing this function.
But, dl_load_lock, although it is a recursive lock, is already held
by the initial thread, so the child thread gets stuck here until
the initial thread releases it after it is done with its constructors
I think it worked in U1 because the regression was introduced by the
patch glibc-dladdr-locking.patch of 2004-02-20. Removing this patch
fixes the problem. I don't suppose this is the way you want to
This should be fixed in U3.