Bug 111548

Summary: calling pthread_cancel in a muti-thread c++ application abort()s the app.
Product: Red Hat Enterprise Linux 3 Reporter: Dan Nuffer <redhatbugzilla>
Component: glibcAssignee: Jakub Jelinek <jakub>
Status: CLOSED NOTABUG QA Contact: Brian Brock <bbrock>
Severity: high Docs Contact:
Priority: medium    
Version: 3.0CC: bkoz, drepper, francois-xavier.kowalski, gav, tdevanes
Target Milestone: ---   
Target Release: ---   
Hardware: i386   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2004-09-28 09:23:21 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Program that demonstrates the problem. none

Description Dan Nuffer 2003-12-05 06:31:54 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.4.1)
Gecko/20031114

Description of problem:
OpenWBEM (http://openwbem.sf.net/) is a multi-threaded c++ program. 
When running the unit tests for the Thread class, one of the tests
cancels a thread by calling pthread_cancel().  This normally works
just fine, but on RH 3.0 (and Fedora 1) quite often abort() will be
called after the following message has been printed to stderr:
FATAL: exception not rethrown

This seems to happen quite often (~3/4 of the time) on a dual-cpu box,
and a little more rare (~1/10th of the time) on a single.

There is a section of code in the first function called from the
thread function which is essentially
catch (...)
{
}
This is to prevent any unexpected exceptions from propagating up any
further, which would cause a segfault.

Judging from the message it seems to imply that the exception would
need to be rethrown.  

I think this is wrong.  The new forced stack unwinding for thread
cancellation hasn't made cancellation any easier or safer to use.  It
seems as if it will /always/ abort the app.  If the exception's not
caught, then you segfault, if it is caught, then you segfault.  How do
you stop it?


Version-Release number of selected component (if applicable):
glibc-2.3.2-101.1

How reproducible:
Sometimes

Steps to Reproduce:
1. Check out the code from OpenWBEM CVS.
2. Build it.
3. run "make check" in the test/unit subdir
    

Actual Results:  The test aborted with the following error:
FATAL: exception not rethrown


Expected Results:  It should have finished sucessfully.

Additional info:
I did read the small section in the release notes about this, and
found the information rather sparse, and I couldn't find any other
information anywhere.
OpenWBEM doesn't use throw() or -fno-exceptions.
The whole thing about disabling/enabling cancellation whenever calling
a C function is completely impractical.  If cancellation can't unwind
the stack correctly while ignoring any throw() or catch(...){} code
that would impede a normal exception, then it should work as it did
before and not bother to unwind the stack.

Comment 1 Dan Nuffer 2003-12-05 06:38:39 UTC
Created attachment 96366 [details]
Program that demonstrates the problem.

This program will seem to work most of the time, but if you run it repeatedly,
it will eventually fail:

[dan@heather tmp]$ while ./a.out >/dev/null; do true; done
FATAL: exception not rethrown
Aborted

Comment 2 Jakub Jelinek 2003-12-11 21:32:29 UTC
Well, the std::cout << line certainly cannot come after setting
exceptions to asynchronous (see http://www.opengroup.org/onlinepubs/007904975/functions/xsh_chap02_09.html#tag_02_09_05_04 )
But moving it before the two pthread_* calls doesn't seem to cure the
situation, nor adding -fasynchronous-unwind-tables commandline option.

Comment 3 Ulrich Drepper 2003-12-11 23:15:02 UTC
The problem seems to be that the I/O code in libstdc++ calls
cancelable functions but doesn't have unwind info for the entire call
path.  This is the backtrace:

#0  unwind_cleanup (reason=_URC_FOREIGN_EXCEPTION_CAUGHT, exc=0x40b52dd0)
    at unwind.c:100
#1  0x4bd4f198 in _Unwind_DeleteException (exc=0x40b52dd0) at
unwind.inc:268
#2  0x4bd203b0 in __cxa_end_catch ()
    at ../../../../libstdc++-v3/libsupc++/eh_catch.cc:117
#3  0x4bd04e77 in std::ostream::write(char const*, int) (this=0x8049de0,
    __s=0x4001ecec "U\211�WS�", __n=8) at ios_base.h:121
#4  0x4bd055bf in std::basic_ostream<char, std::char_traits<char> >&
std::operator<< <std::char_traits<char> >(std::basic_ostream<char,
std::char_traits<char> >&, char const*) (__out=@0x8049de0,
__s=0x8048b5c "started\n")
    at ostream.tcc:651
#5  0x0804892a in the_thread(void*) () at u.cc:12

This is in the corrected version where async cancellation is only
enabled later.

We need to look at every place in libstdc++ where is calls cancelable
functions and make sure all callers of those functions (transitively)
are compiled with unwind info.


Comment 4 Ulrich Drepper 2003-12-12 00:45:33 UTC
Complete backrace from the point the cancellation was thrown:

/usr/src/libc/obj/nptl/libpthread.so.0 [0x400236f0]
/usr/src/libc/obj/elf/ld.so [0x40000c22]
/usr/src/libc/obj/libc.so.6(__write+0x4b) [0x4010b9db]
/usr/src/libc/obj/libc.so.6(_IO_file_write+0x3f) [0x400a6e2f]
/usr/src/libc/obj/libc.so.6 [0x400a5dbe]
/usr/src/libc/obj/libc.so.6(_IO_do_write+0x36) [0x400a5d56]
/usr/src/libc/obj/libc.so.6(_IO_file_overflow+0x159) [0x400a6469]
/usr/src/libc/obj/libc.so.6(_IO_file_xsputn+0xc1) [0x400a6f51]
/usr/src/libc/obj/libc.so.6(_IO_fwrite+0x12f) [0x4009bb5f]
/usr/lib/libstdc++.so.5(_ZNSt12__basic_fileIcE6xsputnEPKci+0x38)
[0x71abb8]
/usr/lib/libstdc++.so.5(_ZNSt13basic_filebufIcSt11char_traitsIcEE22_M_convert_to_externalEPciRiS4_+0x1d1)
[0x6cd361]
/usr/lib/libstdc++.so.5(_ZNSt13basic_filebufIcSt11char_traitsIcEE18_M_really_overflowEi+0xf1)
[0x6cd0f1]
/usr/lib/libstdc++.so.5(_ZNSt13basic_filebufIcSt11char_traitsIcEE8overflowEi+0x9c)
[0x6ccffc]
/usr/lib/libstdc++.so.5(_ZNSt15basic_streambufIcSt11char_traitsIcEE6xsputnEPKci+0x94)
[0x709ee4]
/usr/lib/libstdc++.so.5(_ZNSt13basic_filebufIcSt11char_traitsIcEE6xsputnEPKci+0x38)
[0x6cd8b8]
/usr/lib/libstdc++.so.5(_ZNSo5writeEPKci+0x53) [0x6fff43]
/usr/lib/libstdc++.so.5(_ZStlsISt11char_traitsIcEERSt13basic_ostreamIcT_ES5_PKc+0xff)
[0x7006ef]
/tmp/W(_Z10the_threadPv+0x20) [0x8048c0a]
/usr/src/libc/obj/nptl/libpthread.so.0 [0x4001cc5c]
/usr/src/libc/obj/libc.so.6(__clone+0x5a) [0x401198ca]

Comment 5 Gav Wood 2004-03-16 23:16:25 UTC
i get this problem too, using a glibc 2.3.3 snapshot (dated 
2004-02-07) and (vanilla) kernel 2.6.3 with nptl enabled. 
 
this is the only reference to this bug i can find on google, but 
it's hampering my coding :-(. 
 
how to produce: 
take two threads, A & B and a mutex M and condition C: 
 
A aquires M, sleeps for a second then frees M and exits normally. 
B aquires M, then waits on C indefinately. 
main() starts A, then B, then cancels/joins B, then cancels A, if 
still running. 
 
what should happen: 
0. both threads start 
1. A aquires M; B blocks, waiting for M to become unlocked. 
2. B is cancelled, which is deferred, blocking main(). 
3. A unlocks M. 
4. A exits; B aquires M, blocks on C indefinately. 
5. B, having reached a cancellation point is cancelled. 
6. Program exits. 
 
what actually happens: 
0-3. Correct. 
4. Error given immediately after A's "main" function is exitted 
"FATAL: exception not rethrown". Program immediately aborts. 
 
i'm having trouble getting gdb to function correctly, so i cant 
really give an accurate backtrace, but it sppears to be much the 
same problem described above. 
 
is there any news on a workaround/fix? 
 
cheers, 
 
gav 

Comment 6 Jakub Jelinek 2004-03-16 23:30:12 UTC
Gav, if you don't use PTHREAD_CANCEL_ASYNCHRONOUS, it is unrelated
and you should file a new bugreport instead of appending to an unrelated
one.  Can you come up with a simple testcase which you can reproduce
things on?

Comment 7 Gav Wood 2004-03-17 00:18:41 UTC
done - bug number is #118490. 
gav 

Comment 8 Ulrich Drepper 2004-09-28 09:23:21 UTC
I'm closing the bug.  The original poster never got back and all
points to using library functions while async cancel mode is enabled.
 This is always, 100% of the time, forbidden.

Comment 9 Raj Devanesan 2005-04-15 06:56:44 UTC
Hi in Lehman we are porting our machines to Redhat AS3.0. When I try to port my 
code from previous version to AS3.0 I am having the same probelem. I could not 
come up with any decent solution to address this issue. If the thread code is 
calling a "non-yielding" method from a vendor library, there is no way we can 
exit the thread other than calling the pthread_cancel. 

Modifying the non-yielding vendor API to yielding ( giving up control ) is not 
possible. Is there any way I can make use of pthread_exit() , or any other 
method to solve this problem. My temporary solution is to set the 
LD_KERNEL_ASSUME=2.49.. which makes use of Linuxthreads.