Bug 126111

Summary: pthread_key_create destructor function, and pthread_join don't work during shared library destructors
Product: Red Hat Enterprise Linux 3 Reporter: Noam Lampert <noaml>
Component: glibcAssignee: Jakub Jelinek <jakub>
Status: CLOSED ERRATA QA Contact: Brian Brock <bbrock>
Severity: high Docs Contact:
Priority: medium    
Version: 3.0CC: drepper.fsp, roland, yuvalk
Target Milestone: ---   
Target Release: ---   
Hardware: i686   
OS: Linux   
Whiteboard:
Fixed In Version: 2.3.2-95.24 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2004-09-10 15:53:20 EDT Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
Attachments:
Description Flags
source of reproducer sample none

Description Noam Lampert 2004-06-16 03:02:17 EDT
From Bugzilla Helper:
User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; 
MyIE2; .NET CLR 1.1.4322)

Description of problem:
When a share library performs during its destructors the sequence 
pthread_cancel(tid), pthread_join(tid) then according to the debugger 
the thread dies, but destructor functions registered during 
pthread_key_create are not executed, and pthread_join() hangs 
indefinitely. It is possible the debugger is not telling the complete 
truth and the thread is still alive.

A sample program demonstrating this is attached.

Version-Release number of selected component (if applicable):
glibc-2.3.2-95.20

How reproducible:
Always

Steps to Reproduce:
1. extract the sample shar file by executing it
2. build the program using 'gmake'
3. setenv LD_LIBRARY_PATH .
4. ./exe
    

Actual Results:  Error: key destructor not called
(and infinite hang)

Expected Results:  Success
(and process exit)

Additional info:
Comment 1 Noam Lampert 2004-06-16 03:03:21 EDT
Created attachment 101183 [details]
source of reproducer sample
Comment 2 Noam Lampert 2004-06-16 03:05:51 EDT
I forgot to mention that this is a regression.

In RH EL 3.0 update 1 the sample works (the bug does not occur)
In RH EL 3.0 update 2 this bug is easily reproduced.
Comment 3 Jakub Jelinek 2004-06-16 04:56:45 EDT
I wonder how this could work in U1.
The problem is:
1) shared library destructors are executed with the dl_load_lock
   held to ensure no new shared libraries are loaded during running
   of the destructors.
   This is in the initial thread
2) when a thread is to be cancelled, it uses the unwinder in libgcc_s
   to unwind through the frames, run any pthread cleanups and class
   destructors on the way up
3) the unwinder in libgcc_s uses dl_iterate_phdr interface to query
   all currently loaded shared libraries (this is executed in the
   context of the child thread)
4) dl_iterate_phdr acquires the dl_load_lock, to make sure no new
   shared library is loaded and especially that no shared library
   is unloaded while executing this function.
   But, dl_load_lock, although it is a recursive lock, is already held
   by the initial thread, so the child thread gets stuck here until
   the initial thread releases it after it is done with its constructors
Comment 4 Yuval Kfir 2004-07-01 07:48:32 EDT
I think it worked in U1 because the regression was introduced by the 
patch glibc-dladdr-locking.patch of 2004-02-20.  Removing this patch 
fixes the problem.  I don't suppose this is the way you want to 
proceed though.
Comment 5 Jakub Jelinek 2004-09-10 15:53:20 EDT
This should be fixed in U3.
https://rhn.redhat.com/errata/RHBA-2004-384.html