Bug 218379

Summary:	Large binary causes infinite recursion when breaking into app
Product:	[Fedora] Fedora	Reporter:	Bill Helfinstine <bhelf>
Component:	gdb	Assignee:	Alexandre Oliva <aoliva>
Status:	CLOSED RAWHIDE	QA Contact:
Severity:	medium	Docs Contact:
Priority:	medium
Version:	6	CC:	aoliva, cagney, jan.kratochvil
Target Milestone:	---
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	gdb-6.5-21.fc7	Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2006-12-23 21:46:15 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	192964
Bug Blocks:

Description Bill Helfinstine 2006-12-04 22:14:47 UTC

Description of problem:

With one build of our application, hitting ctrl-c to break into gdb will cause
gdb to go into an infinite loop trying to look up the line where the SIGINT was
delivered.  (It's in poll).  

A stripped-down version of the app doesn't cause this problem to occur, nor does
it happen on 32-bit i386 builds.  My unsupported guess was that it had something
to do with the size of the application being debugged.

The stack trace of gdb when it does this looks like:

(gdb) bt
#0  0x00000000004450a4 in lookup_minimal_symbol_by_pc_section (
    pc=251700325328, section=0x570f500) at gdb/minsyms.c:484
#1  0x00000000004bbfb2 in find_pc_sect_line (pc=251700325328, 
    section=0x570f500, notcurrent=0) at gdb/symtab.c:2057
#2  0x00000000004bc480 in find_pc_line (pc=251700325328, notcurrent=0)
    at gdb/symtab.c:2232
#3  0x00000000004bc1ff in find_pc_sect_line (pc=251700325328, 
    section=0x570f500, notcurrent=0) at gdb/symtab.c:2081

...   (lots and lots of the same two functions with the same parameters)

#1070 0x00000000004bc480 in find_pc_line (pc=251700325328, notcurrent=0)
    at gdb/symtab.c:2232
#1071 0x00000000004bc1ff in find_pc_sect_line (pc=251700325328, 
    section=0x570f500, notcurrent=0) at gdb/symtab.c:2081
#1072 0x00000000004bc480 in find_pc_line (pc=251700325328, notcurrent=0)
    at gdb/symtab.c:2232
#1073 0x00000000004bc1ff in find_pc_sect_line (pc=251700325328, 
    section=0x570f500, notcurrent=0) at gdb/symtab.c:2081
#1074 0x00000000004bc480 in find_pc_line (pc=251700325328, notcurrent=0)
    at gdb/symtab.c:2232
#1075 0x00000000004bc1ff in find_pc_sect_line (pc=251696794399, 
    section=0x59b0df8, notcurrent=0) at gdb/symtab.c:2081
#1076 0x00000000004bc480 in find_pc_line (pc=251696794399, notcurrent=0)
    at gdb/symtab.c:2232
#1077 0x000000000055550e in find_frame_sal (frame=0xb3f3e0, sal=0x7fff1d1a8200)
    at gdb/frame.c:1392
#1078 0x00000000004d86fd in set_current_sal_from_frame (frame=0x1648, center=1)
    at gdb/stack.c:379
#1079 0x00000000004cf137 in normal_stop () at gdb/infrun.c:3147
#1080 0x00000000004d2b2c in proceed (addr=<value optimized out>, 
    siggnal=TARGET_SIGNAL_0, step=0) at gdb/infrun.c:827
#1081 0x00000000004cdbfe in run_command_1 (args=0xafb104 "-federation ur2015", 
    from_tty=1, tbreak_at_main=<value optimized out>) at gdb/infcmd.c:552
#1082 0x0000000000447c6d in execute_command (p=0xafb115 "5", from_tty=1)
    at gdb/top.c:452
#1083 0x00000000004de0fb in command_handler (command=0xafb100 "")
    at gdb/event-top.c:512
#1084 0x00000000004dedc2 in command_line_handler (rl=<value optimized out>)
    at gdb/event-top.c:797
#1085 0x0000000000583eea in rl_callback_read_char () at readline/callback.c:204
#1086 0x00000000004de2c9 in rl_callback_read_char_wrapper (client_data=0x1648)
    at gdb/event-top.c:178
#1087 0x00000000004dcdaf in process_event () at gdb/event-loop.c:343
#1088 0x00000000004dd618 in gdb_do_one_event (data=<value optimized out>)
    at gdb/event-loop.c:380
#1089 0x00000000004da49b in catch_errors (func=0x4dd510 <gdb_do_one_event>, 
    func_args=0x0, errstring=0x6043c6 "", mask=<value optimized out>)
    at gdb/exceptions.c:515
#1090 0x0000000000486e0a in tui_command_loop (data=<value optimized out>)
    at gdb/tui/tui-interp.c:151
#1091 0x0000000000441329 in captured_command_loop (data=0x1648)
    at gdb/main.c:101
#1092 0x00000000004da49b in catch_errors (
    func=0x441320 <captured_command_loop>, func_args=0x0, 
    errstring=0x6043c6 "", mask=<value optimized out>) at gdb/exceptions.c:515
#1093 0x0000000000441a06 in captured_main (data=<value optimized out>)
    at gdb/main.c:835
#1094 0x00000000004da49b in catch_errors (func=0x441360 <captured_main>, 
    func_args=0x7fff1d1a8800, errstring=0x6043c6 "", 
    mask=<value optimized out>) at gdb/exceptions.c:515
#1095 0x0000000000441314 in gdb_main (args=0x13b8) at gdb/main.c:844
#1096 0x00000000004412e6 in main (argc=<value optimized out>, argv=0x13b8)
    at gdb/gdb.c:35


The app that fails is this big:

   text    data     bss     dec     hex filename
35003310        4614208 61891264        101508782       60ce6ae jsaf

and the one that succeeds is this big:

   text    data     bss     dec     hex filename
4918870 4223328 23250728        32392926        1ee46de culture


Unfortunately, it's a really big app, and I can't attach it here.  

I'm willing to try to debug it here, but I don't know what I'm looking for
really.  There's a commented-out line at gdb/symtab.c:2078 that has a comment
that seems to be indicating that it's trying to prevent infinte recursion.  I
don't know if that has any relevance, but it's certainly eye-catching in this
stack trace.



Version-Release number of selected component (if applicable):

gdb-6.5-13.fc6

How reproducible:

Happens every time.

Comment 1 Jan Kratochvil 2006-12-06 10:48:49 UTC

Good catch. The anti-looping comment there is about the warning message, still
it would break the looping (just silently).

Couldn't you use "fstack $PID" to print the backtrace to find out what is the
function it is looping on? To resolve that pc=251700325328 to be able to
possibly make a local reproducibility. The gdb code could be blindly patched but
understanding of the problem and the appropriate testcase would be better.

There other methods like checking "/proc/$PID/maps" which object the address 
251700325328 does belong to and due to prelink(1) it should be enough to:
$ gdb /lib64/libc.so.6
(gdb) disass 251700325328
or so.
Thanks.

Comment 2 Jan Kratochvil 2006-12-09 14:10:36 UTC

As with a trivia `poll(NULL,0,1000*1000)' call it really does not occur and from
the PC address 0x3a9a822bd0 (->0x*bd0) I failed to find poll(2)-related function
in FC6 glibc-2.5-3.x86_64 I would appreciate some core file (is it reproducible
on it?):

Any large (>=20MB) core files can get also uploaded to:
$ ftp -n ftp.jankratochvil.net
login: anonymous
password: e-mail address
cd /incoming
bi
put file
quit

Comment 3 Bill Helfinstine 2006-12-12 22:22:08 UTC

Hum.

Another couple of data points:

gdb 6.5 won't load the process due to the .gnu.hash section:

GNU gdb 6.5
Copyright (C) 2006 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "x86_64-unknown-linux-gnu"...BFD:
/usr/stow/JSAF/src/JSAF/jsaf: don't know how to handle OS specific section
`.gnu.hash' [0x6ffffff6]
"/usr/stow/JSAF/src/JSAF/jsaf": not in executable format: File format not recognized

(gdb) 


The gdb 6.5.90 snapshot doesn't have this problem:

db659 jsaf
GNU gdb 6.5.90
Copyright (C) 2006 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "x86_64-unknown-linux-gnu"...
runUsing host libthread_db library "/lib64/libthread_db.so.1".
(gdb) run


... snip program output ...

Program received signal SIGINT, Interrupt.
0x0000003a9a4c4b1f in poll () from /lib64/libc.so.6
(gdb) bt
#0  0x0000003a9a4c4b1f in poll () from /lib64/libc.so.6
#1  0x00002aaaabaccbca in scheduler::poll_events (this=0x6955d30, run_time=10,
    allowed=1000) at scheduler.cpp:139
#2  0x00002aaaabacce5d in scheduler::tick (this=0x6955d30, min_msecs=6000,
    max_msecs=6000) at scheduler.cpp:69
#3  0x00002aaaaaf9042f in federation_manager::waitForFederationState (
    this=0x69a1700, executionName=0x2046dbe "standard") at fedmgr.cpp:356
#4  0x00002aaaaaf907fd in federation_manager::createFederationExecution (
    this=0x69a1700, executionName=0x2046dbe "standard",
    fedFilename=0x7fff7b3dad60 "../../federations/standard/standard.fed")
    at fedmgr.cpp:134
#5  0x00002aaaaaf76df4 in
local_distribution_manager::create_federation_execution (this=0x6955ae0,
executionName=0x2046dbe "standard",
    FED=0x7fff7b3dad60 "../../federations/standard/standard.fed")
    at ldm.cpp:644
#6  0x00002aaaaaae7324 in rti13::RTIambassador::createFederationExecution (
    this=0x67e0360, executionName=0x2046dbe "standard",
    FED=0x7fff7b3dad60 "../../federations/standard/standard.fed")
    at cpp13_rtiamb.cpp:181
#7  0x0000000001b452fc in Ambassadors::createAndJoinFederationExecution (
    this=0x69558e0, executionName=0x2046dbe "standard",
    fedFileName=0x7fff7b3dad60 "../../federations/standard/standard.fed",
    ridFileName=<value optimized out>,
    federateName=0x7fff7b3db080 "JSAF(Pocket)-OSCEOLA") at amb_init.cc:146
#8  0x00000000018d5441 in ril_init (data_dir=0x1e13555 "../../data",
    reader_flags=1, federate_name=0x7fff7b3db080 "JSAF(Pocket)-OSCEOLA",
    fedex_to_join=0x2046dbe "standard", federations_dir=<value optimized out>,
    afi_federation=0x2046dbe "standard", som_filename=0x2046b29 "som.omt",
    fom_filename=0x50b763b "standard.omt", fed_filename=0x0, rid_filename=0x0,
    ddm_active=0, som_data_check=1, fom_data_check=1, time_regulating=0,
    time_constrained=0, time_managed=0, dc_best_match=1, site=15462,
    host=13877, initial_entity_id=1, cleanup_on_resign=1,
    catch_fatal_signals=1) at ril_init.cc:249
#9  0x00000000018508ab in safrilinit_init () at sshc2_class.cc:311
#10 0x000000000041ac51 in main (argc=1, argv=0x7fff7b3dd788) at main.c:1691
(gdb)                                   







Okay, here's some more information.  I've got two binaries, both of which do the
same startup processing, one of which has a lot more code than the other.  If I
run both processes under gdb, and hit ctrl-c at the same point in both, when
it's pausing to collect data off the network, the smaller binary does this:



Program received signal SIGINT, Interrupt.
0x0000003a9a4c4b1f in *__GI___poll (fds=0x262b960, nfds=1, timeout=10)
    at
/usr/src/debug/gcc-4.1.1-20061011/obj-x86_64-redhat-linux/x86_64-redhat-linux/libstdc++-v3/include/ostream:498
498         endl(basic_ostream<_CharT, _Traits>& __os)
(gdb) bt
#0  0x0000003a9a4c4b1f in *__GI___poll (fds=0x262b960, nfds=1, timeout=10)
    at
/usr/src/debug/gcc-4.1.1-20061011/obj-x86_64-redhat-linux/x86_64-redhat-linux/libstdc++-v3/include/ostream:498
#1  0x00002aaaabac9bca in scheduler::poll_events (this=0x25febf0, run_time=10,
    allowed=1000) at scheduler.cpp:139
#2  0x00002aaaabac9e5d in scheduler::tick (this=0x25febf0, min_msecs=6000,
    max_msecs=6000) at scheduler.cpp:69
#3  0x00002aaaaaf8d42f in federation_manager::waitForFederationState (
    this=0x264a6a0, executionName=0x795a3e "standard") at fedmgr.cpp:356
#4  0x00002aaaaaf8d7fd in federation_manager::createFederationExecution (
    this=0x264a6a0, executionName=0x795a3e "standard",
    fedFilename=0x7fff430e6c60 "../../federations/standard/standard.fed")
    at fedmgr.cpp:134
#5  0x00002aaaaaf73df4 in
local_distribution_manager::create_federation_execution (this=0x25fe980,
executionName=0x795a3e "standard",
    FED=0x7fff430e6c60 "../../federations/standard/standard.fed")
    at ldm.cpp:644
#6  0x00002aaaaaae7324 in rti13::RTIambassador::createFederationExecution (
    this=0x24f35d0, executionName=0x795a3e "standard",
    FED=0x7fff430e6c60 "../../federations/standard/standard.fed")
    at cpp13_rtiamb.cpp:181
#7  0x000000000070007c in Ambassadors::createAndJoinFederationExecution (
    this=0x25fe560, executionName=0x795a3e "standard",
    fedFileName=0x7fff430e6c60 "../../federations/standard/standard.fed",
    ridFileName=<value optimized out>,
    federateName=0x7fff430e6f80 "Culture-OSCEOLA") at amb_init.cc:146
#8  0x000000000056f491 in ril_init (data_dir=0x76f695 "../../data",
    reader_flags=1, federate_name=0x7fff430e6f80 "Culture-OSCEOLA",
    fedex_to_join=0x795a3e "standard", federations_dir=<value optimized out>,
    afi_federation=0x795a3e "standard", som_filename=0x7957a9 "som.omt",
    fom_filename=0xf2b91b "standard.omt", fed_filename=0x0, rid_filename=0x0,
    ddm_active=0, som_data_check=1, fom_data_check=1, time_regulating=0,
    time_constrained=0, time_managed=0, dc_best_match=1, site=15462,
    host=13770, initial_entity_id=1, cleanup_on_resign=1,
    catch_fatal_signals=1) at ril_init.cc:249
#9  0x000000000054e73b in safrilinit_init () at cluf_init.cc:494
#10 0x000000000040abd5 in main (argc=1, argv=0x7fff430e7488) at main.c:635
(gdb)       



The larger binary, at the same point in its startup, will make gdb go into its
infinite loop when ctrl-c is hit (before it returns back with a prompt).

When I try to run fstack on the larger app at the same point, it aborts:

fstack `pidof jsaf`
Abort


I'll upload a coredump of gdb when it's in its infinite loop, and the process
being debugged is at the same point as the above stack trace (in the poll system
call).

Comment 4 Jan Kratochvil 2006-12-12 22:30:34 UTC

It would be more useful for me to start the process as:
(ulimit -c unlimited;./run-the-jsaf-process)
and in that point give it
kill -SEGV `pidof jsaf`
and it should create "core" file in its current directory (unless jsaf turns off
its core file limit explicitely which should not be done).
If gdb-6.5.90 does not have the problem you may also be able to do with it:
gcore -o core-file-name `pidof jsaf`
The file from:
rpm --qf "%{name}-%{version}-%{release}.%{arch}\n" -qa|sort >rpm-qa
(or its limited version - particularly the glibc line) would be also nice.
Thanks!

Comment 5 Bill Helfinstine 2006-12-13 00:29:15 UTC

Okay, I've got a core file from sending the process a kill -SEGV (way smaller
than the one fcore made).  I've uploaded it to ftp.whirpon.com as jsaf_core, and
also the rpm -qa output, as jsaf_rpm_qa.  It's in the anonymous account.

Loading that core file into gdb causes the recursion to happen immediately, so
it does seem reproducable.

The glibc that's installed is glibc-2.5-3.x86_64, btw.

Comment 6 Jan Kratochvil 2006-12-17 23:59:13 UTC

As the core file is not much helpful without its "jsaf" binary with the
debuginfo providing some gdb versions for possible testing on your system.
I did not find a way how to produce a binary causing this lockup, still I can
imagine such a binary may exist (trampoline of a function FOO located some bytes
before the function FOO text, in the same file but in a different section).
It looks as a regression caused by a fix of Bug 192964.

I would welcome three PASS/FAIL results (one result for each of `clean',
'ppcupdate' and `full') of:
http://www.jankratochvil.net/priv/bz218379/

Thanks.

Comment 7 Bill Helfinstine 2006-12-18 21:44:09 UTC

Tried all three versions, all of which PASS.

Then I noticed the 6.5-15 update, which also PASSes.  

So, one of the changes between 6.5-13 and 6.5-15 appears to have fixed the problem!

Thank you for your help!

Comment 8 Jan Kratochvil 2006-12-18 23:26:47 UTC

Thanks for the testing...
Well... I spent some time on it and there may be some problem, I believe it just
got hidden by the 6.5-15 update.
Could you please test yet those new pre* releleases there?
http://www.jankratochvil.net/priv/bz218379/
Otherwise at least it got fixed, thanks for the report.

Comment 9 Bill Helfinstine 2006-12-18 23:43:55 UTC

Ah, okay.  It does seem likely that the fix for glibc debuginfo hid the bug in
this case.

Anyway, I tried the three pre* releases, and got the following outcomes:

6.5-19_preclean.fc6rh: PASS
6.5-19_preppcupdate.fc6rh: FAIL, same infinte loop as 6.5-13
6.5-19_prefull.fc6rh: PASS, with odd error messages:


Program received signal SIGINT, Interrupt.
warning: In stub for tanh (0x0000003a9a822bd0); interlocked, please submit the
binary to http://bugzilla.redhat.com
warning: In stub for tanh (0x0000003a9a822bd0); interlocked, please submit the
binary to http://bugzilla.redhat.com
0x0000003a9a4c4b1f in *__GI___poll (fds=0x698bce0, nfds=1, timeout=3)
    at ../sysdeps/ieee754/dbl-64/s_tanh.c:56
56      {
warning: In stub for tanh (0x0000003a9a822bd0); interlocked, please submit the
binary to http://bugzilla.redhat.com
(gdb) bt
warning: In stub for tanh (0x0000003a9a822bd0); interlocked, please submit the
binary to http://bugzilla.redhat.com
#0  0x0000003a9a4c4b1f in *__GI___poll (fds=0x698bce0, nfds=1, timeout=3)
    at ../sysdeps/ieee754/dbl-64/s_tanh.c:56
#1  0x00002aaaabaccbca in scheduler::poll_events (this=0x695ed30, run_time=3,
    allowed=1000) at scheduler.cpp:139
#2  0x00002aaaabacce5d in scheduler::tick (this=0x695ed30, min_msecs=6000,
    max_msecs=6000) at scheduler.cpp:69
...


The stack trace is correct, but it got the filename confused in the first frame,
which I'm assuming (?) causes the error about the interlocked stub.

Comment 10 Jan Kratochvil 2006-12-23 21:46:15 UTC

As I see the `preppcupdate' variant did not fix it so the workaround was needed.
Unfortunately it is a pain to debug it remotely so keeping there the Bugzilla
submit request and I hope to get some locally reproducible bugreport later.
The message should never appear in cases not resulting in the deadlock anyway.

Committed to RawHide as:
* Sat Dec 23 2006 Jan Kratochvil <jan.kratochvil> - 6.5-21
- Fix lockup on trampoline vs. its function lookup; unreproducible (BZ 218379).

Thanks for the extensive testing and for the effective workaround found this
way.  Regards and glad to CLOSE it now.  And no FC6 update apparently needed.