161061 – program segfaults when built with gcc 4.0.0

Bug 161061 - program segfaults when built with gcc 4.0.0

Summary: program segfaults when built with gcc 4.0.0

Keywords:
Status:	CLOSED RAWHIDE
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	gcc
Sub Component:
Version:	4
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Jakub Jelinek
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	162160
TreeView+	depends on / blocked

Reported:	2005-06-20 11:37 UTC by Michael Schwendt
Modified:	2007-11-30 22:11 UTC (History)
CC List:	1 user (show)
Fixed In Version:	4.0.1-3
Clone Of:
Environment:
Last Closed:	2005-07-15 11:25:31 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
wesnoth-strace.txt.gz (42.39 KB, application/octet-stream) 2005-07-12 10:01 UTC, Michael Schwendt	no flags	Details
wesnoth-strace-f.txt.gz (71 K) (70.13 KB, application/octet-stream) 2005-07-12 10:48 UTC, Michael Schwendt	no flags	Details
tiny testcase (1.12 KB, text/plain) 2005-07-14 13:42 UTC, Michael Schwendt	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
GNU Compiler Collection	22309	0	None	None	None	Never

Description Michael Schwendt 2005-06-20 11:37:00 UTC

This is not reproducible with FC3 and gcc-c++-3.4.3-22.fc3. Upstream developers
have not heard any complaints so far either. Their software is built for
Windows, MacOS and Linux regularly. I cannot rule out that GCC miscompiles
something here. Hence the report (upstream has got a report earlier).

[...]

"wesnoth" 0.9.2 from Fedora Extras devel tree crashes when built for FC4
(gcc-c++-4.0.0-8) or Rawhide (gcc-c++-4.0.0-12):

Program received signal SIGSEGV, Segmentation fault.
  [Switching to Thread 54066096 (zombie)]
  0x00a353a8 in ?? ()
  (gdb) bt
  #0  0x00a353a8 in ?? ()
  #1  0x0026ab7a in __nptl_deallocate_tsd () from /lib/libpthread.so.0
  #2  0x0026bb8e in start_thread () from /lib/libpthread.so.0
  #3  0x00c79dee in clone () from /lib/libc.so.6
  (gdb) t
  [Current thread is 4 (Thread 54066096 (zombie))]

Valgrind says:

==9255== 
==9255== Thread 4:
==9255== Jump to the invalid address stated on the next line
==9255==    at 0x1BAC23A8: ???
==9255==    by 0x26BB8D: start_thread (in /lib/libpthread-2.3.5.so)
==9255==    by 0xC79DED: clone (in /lib/libc-2.3.5.so)
==9255==  Address 0x1BAC23A8 is not stack'd, malloc'd or (recently) free'd
==9255== 
==9255== ---- Attach to debugger ? --- [Return/N/n/Y/y/C/c] ---- 

The other three threads are in __kernel_vsyscall. The thread functions are global.

Steps to reproduce:

1) build packages from Fedora Extras CVS devel/wesnoth
2) install "wesnoth"
3) start "wesnoth"
4) enter multiplayer menu and connect with official server

[...]

Building without optimisations makes no change.

The program uses SDL as a pthread wrapper and SDL_net. Rebuilding SDL without
optflags doesn't change anything either. Simple SDL thread and SDL network test
code doesn't crash, so it is not an obvious fault in SDL.  The first version of
"wesnoth", 0.8.6, which introduced SDL and SDL_net usage and is several months
old, is the first that crashes when built for FC4.

Comment 1 Jakub Jelinek 2005-06-20 12:58:59 UTC

It is much more likely that this is an application bug (either wesnoth or
SDL*) than glibc bug (and I'd say it surely has nothing to do with gcc).
The above backtrace is what happens if some thread specific data destructor
is invalid at thread creation time.
See pthread_key_create(3p) and pthread_key_delete(3p).
Most likely a shared library registered a destructor with pthread_key_create,
forgot to delete that key (pthread_key_delete) in its destructor code and was
unloaded (dlclose) by the application.
Can you please debug whether this is the case?

Comment 2 Michael Schwendt 2005-06-20 16:04:00 UTC

SDL doesn't use pthread_key_{create,delete} at all. Wesnoth doesn't dlclose
anything either. And Wesnoth only uses SDL's Thread API, where user provides a
thread function ptr and ptr to user data. Downgrading SDL to a rebuild of FC3's
SDL version doesn't change anything.

The crash happens when a thread function returns, main program waiting in
SDL_WaitThread.

When debugging inside ddd I can confirm your theory at least. Inside
__nptl_deallocate_tsd, when I step through the loop, the fourth data pointer is
non-NULL. The called destructor apparently points into nirvana (it once stepped
into a completely unrelated library, /usr/lib/libaudiofile.so.0). Now I need to
find out where the destructor pointer comes from...

Comment 3 Michael Schwendt 2005-06-20 21:46:33 UTC

Hmm, calls of pthread_key_create return into libstdc++ while stepping through
the code. Calls of pthread_key_delete are missing...

Breakpoint 2, __pthread_key_create (key=0x45d400, destr=0x8000) at
pthread_key_create.c:35
(gdb) bt
#0  __pthread_key_create (key=0x45d400, destr=0x8000) at pthread_key_create.c:35
#1  0x003bc460 in __gnu_cxx::__pool<true>::_M_initialize () from
/usr/lib/libstdc++.so.6
#2  0x003bd691 in __gnu_cxx::__pool<true>::_M_reclaim_block () from
/usr/lib/libstdc++.so.6
#3  0x0026e92b in ?? () from /lib/libpthread.so.0
#4  0x003bd001 in __gnu_cxx::__pool<true>::_M_reclaim_block () from
/usr/lib/libstdc++.so.6
#5  0x0040727d in std::string::_Rep::_S_create () from /usr/lib/libstdc++.so.6
#6  0x0040b25f in std::operator+<char, std::char_traits<char>,
std::allocator<char> > () from /usr/lib/libstdc++.so.6
#7  0x0040be7c in std::basic_string<char, std::char_traits<char>,
std::allocator<char> >::basic_string () from /usr/lib/libstdc++.so.6
#8  0x083998d5 in __static_initialization_and_destruction_0
(__initialize_p=Variable "__initialize_p" is not available.
) at widgets/slider.cpp:24
#9  0x083a25aa in __do_global_ctors_aux ()
#10 0x0804cf3d in _init ()
#11 0x083a2556 in __libc_csu_init ()
#12 0x00bc3d83 in __libc_start_main (main=0x8166626 <main>, argc=1,
ubp_av=0xbf8d3a54, init=0x83a253c <__libc_csu_init>, fini=0x83a258c
<__libc_csu_fini>, rtld_fini=0xb9ef2d <_dl_fini>, stack_end=0xbf8d3a4c) at
../sysdeps/generic/libc-start.c:183
#13 0x0804e271 in _start ()

Comment 4 Jakub Jelinek 2005-06-20 22:21:45 UTC

Thanks.  Then it really is a bug in the libstdc++ mt allocator.

Comment 5 Benjamin Kosnik 2005-06-22 14:50:11 UTC

Can you please attach whatever code you are using to reproduce?

Comment 6 Jakub Jelinek 2005-06-22 15:30:42 UTC

cat > O.c <<EOF
#include <dlfcn.h>
#include <pthread.h>

void *
tf (void *arg)
{
  void *h = dlopen ("./libO.so", RTLD_LAZY);
  void (*fn) (void);
  if (!h) return 0;
  fn = dlsym (h, "foo");
  fn ();
  dlclose (h);
  return 0;
}

int
main (void)
{
  pthread_t th;
  pthread_create (&th, NULL, tf, NULL);
  pthread_join (th, NULL);
  return 0;
}
EOF
cat > libO.C <<EOF
#include <string>

extern "C" void
foo (void)
{
  std::string s;
  s += "hello";
}
EOF
g++ -g -O2 -shared -fpic -o libO.so libO.C
gcc -g -O2 -o O O.c -ldl -lpthread

(when mt is the default allocator, not sure how the testcase would look like
when e.g. new is the default allocator, but you want to forcibly use mt
allocator).
pthread_key_delete needs to be called somewhere in libstdc++.so's destructors
(but late enough so that __pool<true>::_M_get_thread_id() is not called any
longer.

Comment 7 Jakub Jelinek 2005-07-07 17:59:18 UTC

http://gcc.gnu.org/ml/gcc-patches/2005-07/msg00478.html

Comment 8 Michael Schwendt 2005-07-10 18:15:11 UTC

Saw the libstdc++ rpm changelog that 4.0.1-1 is supposed to fix this, but it
still crashes in the same way.

Segfault in Blender (bug 157922) seems as if it occurs in the same part of
libstdc++.

Comment 9 Jakub Jelinek 2005-07-10 20:40:18 UTC

This patch is supposed to fix crashes that are seen when dlclosing
libstdc++.so.6.
The backtrace that is given in #157922 is different, has nothing to do with
this bug.

Comment 10 Michael Schwendt 2005-07-10 20:54:37 UTC

Sorry. You misunderstood me. libstdc++-4.0.1-1 does not fix this.

Comment 11 Jakub Jelinek 2005-07-11 15:14:43 UTC

The testcase I added in #6 certainly works just fine with libstdc++-4.0.1-[12],
but segfaults with older libstdc++ rpms.
So, what backtrace do you get on the crash, and if it from __nptl_deallocate_tsd
what routine used to contain the $pc address?

Comment 12 Michael Schwendt 2005-07-11 21:42:11 UTC

This is what I see:

(gdb) print destr
$11 = (void (*)(void *)) 0xb61d48
<__gnu_cxx::__common_pool_policy<__gnu_cxx::__pool,
true>::_S_destroy_thread_key(void*)>

(gdb) stepi
0x00b61d48 in ?? ()

(gdb) bt
#0  0x00b61d48 in ?? ()
#1  0x0092cb7a in __nptl_deallocate_tsd () at pthread_create.c:153
#2  0x0092db97 in start_thread (arg=0x43acbb0) at pthread_create.c:268
#3  0x0079d34e in ?? () from /lib/libc.so.6

pthread_create.c:153
                            /* Call the user-provided destructor.  */
                            __pthread_keys[idx].destr (data);

Comment 13 Michael Schwendt 2005-07-12 08:01:32 UTC

As an additional observation, like in comment 3, breakpoint on
pthread_key_delete doesn't catch anything.

Comment 14 Jakub Jelinek 2005-07-12 08:10:08 UTC

Can you attach LD_DEBUG=all dump of the program?
For me pthread_key_delete is certainly called:
     18660:     calling fini: /usr/lib/libstdc++.so.6 [0]
     18660:
     18660:     symbol=__cxa_finalize;  lookup in file=./O [0]
     18660:     symbol=__cxa_finalize;  lookup in file=/lib/libdl.so.2 [0]
     18660:     symbol=__cxa_finalize;  lookup in file=/lib/libpthread.so.0 [0]
     18660:     symbol=__cxa_finalize;  lookup in file=/lib/libc.so.6 [0]
     18660:     binding file /usr/lib/libstdc++.so.6 [0] to /lib/libc.so.6 [0]:
normal symbol `__cxa_finalize' [GLIBC_2.1.3]
     18660:     symbol=pthread_key_delete;  lookup in file=./O [0]
     18660:     symbol=pthread_key_delete;  lookup in file=/lib/libdl.so.2 [0]
     18660:     symbol=pthread_key_delete;  lookup in file=/lib/libpthread.so.0
[0]
     18660:     binding file /usr/lib/libstdc++.so.6 [0] to /lib/libpthread.so.0
[0]: normal symbol `pthread_key_delete'

from within libstdc++.so.6's destructors.

Comment 15 Michael Schwendt 2005-07-12 08:28:44 UTC

http://home.arcor.de/ms2002sep/tmp/wesnoth-LD_DEBUG.txt.bz2
(730 K compressed, 34 M uncompressed)

[...]

Breakpoint here catches pthread_key_delete only after segfault has happened.

[New Thread 34855856 (LWP 847)]
[New Thread 106400688 (LWP 848)]

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 106400688 (zombie)]
0x00543d48 in ?? ()
(gdb) c
Fatal signal: Segmentation Fault (SDL Parachute Deployed)
[Thread 24366000 (LWP 846) exited]

Breakpoint 3, pthread_key_delete (key=4) at pthread_key_delete.c:31
(gdb)

Comment 16 Jakub Jelinek 2005-07-12 09:26:03 UTC

And libstdc++.so got unmapped before the segfault?  That would make no sense,
as pthread_key_delete then certainly couldn't be called afterwards.
Perhaps what you are seeing is not libstdc++.so being unmapped before the crash,
but simply some instruction in _S_destroy_thread_key causing the segfault.
That could have two reasons IMHO: the far more likely is a bug in wesnoth's
(or its libraries) memory management, where it would overflow some buffer into
mt allocator's control structures (similarly how you crash in an unrelated malloc
call if you overflow some heap buffer), the other possibility could be mt
allocator bug.
Running the program under strace could reveal whether libstdc++.so is mapped
at that point or not.

Comment 17 Michael Schwendt 2005-07-12 10:01:44 UTC

Created attachment 116641 [details]
wesnoth-strace.txt.gz

Normal strace? Or any special options?

[...]

I don't think I would get "Cannot access memory at address 0x492d48" in gdb for
just corrupted memory. Whatever user-provided destructor it tries to call, it's
not there anymore.

[...]

Wesnoth is in Fedora Extras Development (both CVS and repository, also for
debuginfo). You could have a look yourself.

Memory corruption would imply that a bug is in there for almost a year (since
the v0.8.6 which was the first to implement threaded networking) and has not
been noticed or discovered for distributions older than FC4/gcc4.

Comment 18 Jakub Jelinek 2005-07-12 10:07:54 UTC

strace -f at least, though perhaps
strace -E LD_DEBUG=all -f
would be even more helpful.

I'm aware that wesnoth is in extras, but unfortunately I couldn't reproduce it
remotely on my FC4 workstation (apparently requires running from local console),
so it would need to wait a few days.

Comment 19 Michael Schwendt 2005-07-12 10:48:38 UTC

Created attachment 116645 [details]
wesnoth-strace-f.txt.gz (71 K)

http://home.arcor.de/ms2002sep/tmp/wesnoth-LD_DEBUG_strace-f.txt.bz2
(1.5 M compressed, >100 M uncompressed)

Comment 20 Jakub Jelinek 2005-07-12 11:11:18 UTC

The wesnoth-LD_DEBUG_strace-f.txt dump tells me at the point of the segfault
libstdc++.so.6 is still mapped and the thread with tid 1557 in which the
segfault happens is currently being cancelled.  You should be able to verify
easily in the debugger that libstdc++.so.6 is still mapped, just wait in gdb
till it segfaults and print /proc/<pid>/maps from another terminal.
So the question really is on which exact instruction in the routine it segfaults,
what were the registers etc.

Comment 21 Michael Schwendt 2005-07-12 12:03:41 UTC

http://savannah.nongnu.org/bugs/?func=detailitem&item_id=13404

Well, it doesn't crash inside the thread function. The thread function
terminates with return 0, and then via pthread_exit it enters
__nptl_deallocate_tsd () where the aforementioned crash happens. Valgrind says
the called address "is not stack'd, malloc'd or (recently) free'd".

Comment 22 Jakub Jelinek 2005-07-12 12:10:06 UTC

Of course the address is not supposed to be malloc'd, free'd or on the stack.
The question is if that address corresponds to the text segment of libstdc++.so.6
and if that part of libstdc++.so.6's text segment contains the code it is
supposed to contain.

Comment 23 Michael Schwendt 2005-07-12 16:01:01 UTC

Did the following: Create a breakpoint in pthread_key_create and collect all
destructor addresses. The last one before I make the program crash is this:

Breakpoint 5, __pthread_key_create (key=0x937144, destr=0x11cd48
<__gnu_cxx::__common_pool_policy<__gnu_cxx::__pool,
true>::_S_destroy_thread_key(void*)>) at pthread_key_create.c:49

A look at /proc/.../maps does not show any memory area for that address.
Libstdc++ was here:

033fe000-034de000 r-xp 00000000 03:08 909190     /usr/lib/libstdc++.so.6.0.5
034de000-034e3000 rwxp 000e0000 03:08 909190     /usr/lib/libstdc++.so.6.0.5

Then:

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 46013360 (zombie)]
0x0011cd48 in ?? ()

Dump of assembler code from 0x11cd48 to 0x11ce48:
    0x0011cd48:     Cannot access memory at address 0x11cd48

And again, /proc does not show any memory area, which includes 0x11cd48:

00111000-00115000 r-xp 00000000 03:08 482527     /lib/libnss_dns-2.3.90.so
00115000-00116000 r-xp 00003000 03:08 482527     /lib/libnss_dns-2.3.90.so
00116000-00117000 rwxp 00004000 03:08 482527     /lib/libnss_dns-2.3.90.so
001b1000-001ba000 r-xp 00000000 03:08 482529     /lib/libnss_files-2.3.90.so
001ba000-001bb000 r-xp 00008000 03:08 482529     /lib/libnss_files-2.3.90.so
001bb000-001bc000 rwxp 00009000 03:08 482529     /lib/libnss_files-2.3.90.so

[...]

Repeated this and found destr to point into /usr/lib/libartscbackend.so.0.0.0
which was no longer mapped when it segfaulted.

[...]

Repeated this a third time and found destr to point into
/usr/lib/libartscbackend.so.0.0.0 again which was no longer mapped when it
segfaulted.

$ ldd /usr/lib/libartscbackend.so.0.0.0 | grep -e 'thread\|std'
        libgthread-2.0.so.0 => /usr/lib/libgthread-2.0.so.0 (0x0011a000)
        libstdc++.so.6 => /usr/lib/libstdc++.so.6 (0x00854000)
        libpthread.so.0 => /lib/libpthread.so.0 (0x00f1f000)

Comment 24 Michael Schwendt 2005-07-12 19:13:08 UTC

Hmm... SDL_audio dlopens libartscbackend, this triggers the aforementioned call
of pthread_key_create, then SDL_audio dlcloses libartscbackend, but no
pthread_key_delete call is catched by breakpoint.

Comment 25 Michael Schwendt 2005-07-14 13:42:06 UTC

Created attachment 116749 [details]
tiny testcase

As a status update, I've reached this point early this morning:

$ g++ -g -Wall -I/usr/include/SDL testcase_161061.cpp -o testcase_161061 -lSDL
$ ./testcase_161061 
Segmentation fault

Comment 26 Jakub Jelinek 2005-07-14 13:44:40 UTC

http://gcc.gnu.org/ml/gcc-patches/2005-07/msg00993.html

Note You need to log in before you can comment on or make changes to this bug.