Bug 218379
Summary: | Large binary causes infinite recursion when breaking into app | ||
---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Bill Helfinstine <bhelf> |
Component: | gdb | Assignee: | Alexandre Oliva <aoliva> |
Status: | CLOSED RAWHIDE | QA Contact: | |
Severity: | medium | Docs Contact: | |
Priority: | medium | ||
Version: | 6 | CC: | aoliva, cagney, jan.kratochvil |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | gdb-6.5-21.fc7 | Doc Type: | Bug Fix |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2006-12-23 21:46:15 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 192964 | ||
Bug Blocks: |
Description
Bill Helfinstine
2006-12-04 22:14:47 UTC
Good catch. The anti-looping comment there is about the warning message, still it would break the looping (just silently). Couldn't you use "fstack $PID" to print the backtrace to find out what is the function it is looping on? To resolve that pc=251700325328 to be able to possibly make a local reproducibility. The gdb code could be blindly patched but understanding of the problem and the appropriate testcase would be better. There other methods like checking "/proc/$PID/maps" which object the address 251700325328 does belong to and due to prelink(1) it should be enough to: $ gdb /lib64/libc.so.6 (gdb) disass 251700325328 or so. Thanks. As with a trivia `poll(NULL,0,1000*1000)' call it really does not occur and from the PC address 0x3a9a822bd0 (->0x*bd0) I failed to find poll(2)-related function in FC6 glibc-2.5-3.x86_64 I would appreciate some core file (is it reproducible on it?): Any large (>=20MB) core files can get also uploaded to: $ ftp -n ftp.jankratochvil.net login: anonymous password: e-mail address cd /incoming bi put file quit Hum. Another couple of data points: gdb 6.5 won't load the process due to the .gnu.hash section: GNU gdb 6.5 Copyright (C) 2006 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type "show copying" to see the conditions. There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "x86_64-unknown-linux-gnu"...BFD: /usr/stow/JSAF/src/JSAF/jsaf: don't know how to handle OS specific section `.gnu.hash' [0x6ffffff6] "/usr/stow/JSAF/src/JSAF/jsaf": not in executable format: File format not recognized (gdb) The gdb 6.5.90 snapshot doesn't have this problem: db659 jsaf GNU gdb 6.5.90 Copyright (C) 2006 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type "show copying" to see the conditions. There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "x86_64-unknown-linux-gnu"... runUsing host libthread_db library "/lib64/libthread_db.so.1". (gdb) run ... snip program output ... Program received signal SIGINT, Interrupt. 0x0000003a9a4c4b1f in poll () from /lib64/libc.so.6 (gdb) bt #0 0x0000003a9a4c4b1f in poll () from /lib64/libc.so.6 #1 0x00002aaaabaccbca in scheduler::poll_events (this=0x6955d30, run_time=10, allowed=1000) at scheduler.cpp:139 #2 0x00002aaaabacce5d in scheduler::tick (this=0x6955d30, min_msecs=6000, max_msecs=6000) at scheduler.cpp:69 #3 0x00002aaaaaf9042f in federation_manager::waitForFederationState ( this=0x69a1700, executionName=0x2046dbe "standard") at fedmgr.cpp:356 #4 0x00002aaaaaf907fd in federation_manager::createFederationExecution ( this=0x69a1700, executionName=0x2046dbe "standard", fedFilename=0x7fff7b3dad60 "../../federations/standard/standard.fed") at fedmgr.cpp:134 #5 0x00002aaaaaf76df4 in local_distribution_manager::create_federation_execution (this=0x6955ae0, executionName=0x2046dbe "standard", FED=0x7fff7b3dad60 "../../federations/standard/standard.fed") at ldm.cpp:644 #6 0x00002aaaaaae7324 in rti13::RTIambassador::createFederationExecution ( this=0x67e0360, executionName=0x2046dbe "standard", FED=0x7fff7b3dad60 "../../federations/standard/standard.fed") at cpp13_rtiamb.cpp:181 #7 0x0000000001b452fc in Ambassadors::createAndJoinFederationExecution ( this=0x69558e0, executionName=0x2046dbe "standard", fedFileName=0x7fff7b3dad60 "../../federations/standard/standard.fed", ridFileName=<value optimized out>, federateName=0x7fff7b3db080 "JSAF(Pocket)-OSCEOLA") at amb_init.cc:146 #8 0x00000000018d5441 in ril_init (data_dir=0x1e13555 "../../data", reader_flags=1, federate_name=0x7fff7b3db080 "JSAF(Pocket)-OSCEOLA", fedex_to_join=0x2046dbe "standard", federations_dir=<value optimized out>, afi_federation=0x2046dbe "standard", som_filename=0x2046b29 "som.omt", fom_filename=0x50b763b "standard.omt", fed_filename=0x0, rid_filename=0x0, ddm_active=0, som_data_check=1, fom_data_check=1, time_regulating=0, time_constrained=0, time_managed=0, dc_best_match=1, site=15462, host=13877, initial_entity_id=1, cleanup_on_resign=1, catch_fatal_signals=1) at ril_init.cc:249 #9 0x00000000018508ab in safrilinit_init () at sshc2_class.cc:311 #10 0x000000000041ac51 in main (argc=1, argv=0x7fff7b3dd788) at main.c:1691 (gdb) Okay, here's some more information. I've got two binaries, both of which do the same startup processing, one of which has a lot more code than the other. If I run both processes under gdb, and hit ctrl-c at the same point in both, when it's pausing to collect data off the network, the smaller binary does this: Program received signal SIGINT, Interrupt. 0x0000003a9a4c4b1f in *__GI___poll (fds=0x262b960, nfds=1, timeout=10) at /usr/src/debug/gcc-4.1.1-20061011/obj-x86_64-redhat-linux/x86_64-redhat-linux/libstdc++-v3/include/ostream:498 498 endl(basic_ostream<_CharT, _Traits>& __os) (gdb) bt #0 0x0000003a9a4c4b1f in *__GI___poll (fds=0x262b960, nfds=1, timeout=10) at /usr/src/debug/gcc-4.1.1-20061011/obj-x86_64-redhat-linux/x86_64-redhat-linux/libstdc++-v3/include/ostream:498 #1 0x00002aaaabac9bca in scheduler::poll_events (this=0x25febf0, run_time=10, allowed=1000) at scheduler.cpp:139 #2 0x00002aaaabac9e5d in scheduler::tick (this=0x25febf0, min_msecs=6000, max_msecs=6000) at scheduler.cpp:69 #3 0x00002aaaaaf8d42f in federation_manager::waitForFederationState ( this=0x264a6a0, executionName=0x795a3e "standard") at fedmgr.cpp:356 #4 0x00002aaaaaf8d7fd in federation_manager::createFederationExecution ( this=0x264a6a0, executionName=0x795a3e "standard", fedFilename=0x7fff430e6c60 "../../federations/standard/standard.fed") at fedmgr.cpp:134 #5 0x00002aaaaaf73df4 in local_distribution_manager::create_federation_execution (this=0x25fe980, executionName=0x795a3e "standard", FED=0x7fff430e6c60 "../../federations/standard/standard.fed") at ldm.cpp:644 #6 0x00002aaaaaae7324 in rti13::RTIambassador::createFederationExecution ( this=0x24f35d0, executionName=0x795a3e "standard", FED=0x7fff430e6c60 "../../federations/standard/standard.fed") at cpp13_rtiamb.cpp:181 #7 0x000000000070007c in Ambassadors::createAndJoinFederationExecution ( this=0x25fe560, executionName=0x795a3e "standard", fedFileName=0x7fff430e6c60 "../../federations/standard/standard.fed", ridFileName=<value optimized out>, federateName=0x7fff430e6f80 "Culture-OSCEOLA") at amb_init.cc:146 #8 0x000000000056f491 in ril_init (data_dir=0x76f695 "../../data", reader_flags=1, federate_name=0x7fff430e6f80 "Culture-OSCEOLA", fedex_to_join=0x795a3e "standard", federations_dir=<value optimized out>, afi_federation=0x795a3e "standard", som_filename=0x7957a9 "som.omt", fom_filename=0xf2b91b "standard.omt", fed_filename=0x0, rid_filename=0x0, ddm_active=0, som_data_check=1, fom_data_check=1, time_regulating=0, time_constrained=0, time_managed=0, dc_best_match=1, site=15462, host=13770, initial_entity_id=1, cleanup_on_resign=1, catch_fatal_signals=1) at ril_init.cc:249 #9 0x000000000054e73b in safrilinit_init () at cluf_init.cc:494 #10 0x000000000040abd5 in main (argc=1, argv=0x7fff430e7488) at main.c:635 (gdb) The larger binary, at the same point in its startup, will make gdb go into its infinite loop when ctrl-c is hit (before it returns back with a prompt). When I try to run fstack on the larger app at the same point, it aborts: fstack `pidof jsaf` Abort I'll upload a coredump of gdb when it's in its infinite loop, and the process being debugged is at the same point as the above stack trace (in the poll system call). It would be more useful for me to start the process as: (ulimit -c unlimited;./run-the-jsaf-process) and in that point give it kill -SEGV `pidof jsaf` and it should create "core" file in its current directory (unless jsaf turns off its core file limit explicitely which should not be done). If gdb-6.5.90 does not have the problem you may also be able to do with it: gcore -o core-file-name `pidof jsaf` The file from: rpm --qf "%{name}-%{version}-%{release}.%{arch}\n" -qa|sort >rpm-qa (or its limited version - particularly the glibc line) would be also nice. Thanks! Okay, I've got a core file from sending the process a kill -SEGV (way smaller than the one fcore made). I've uploaded it to ftp.whirpon.com as jsaf_core, and also the rpm -qa output, as jsaf_rpm_qa. It's in the anonymous account. Loading that core file into gdb causes the recursion to happen immediately, so it does seem reproducable. The glibc that's installed is glibc-2.5-3.x86_64, btw. As the core file is not much helpful without its "jsaf" binary with the debuginfo providing some gdb versions for possible testing on your system. I did not find a way how to produce a binary causing this lockup, still I can imagine such a binary may exist (trampoline of a function FOO located some bytes before the function FOO text, in the same file but in a different section). It looks as a regression caused by a fix of Bug 192964. I would welcome three PASS/FAIL results (one result for each of `clean', 'ppcupdate' and `full') of: http://www.jankratochvil.net/priv/bz218379/ Thanks. Tried all three versions, all of which PASS. Then I noticed the 6.5-15 update, which also PASSes. So, one of the changes between 6.5-13 and 6.5-15 appears to have fixed the problem! Thank you for your help! Thanks for the testing... Well... I spent some time on it and there may be some problem, I believe it just got hidden by the 6.5-15 update. Could you please test yet those new pre* releleases there? http://www.jankratochvil.net/priv/bz218379/ Otherwise at least it got fixed, thanks for the report. Ah, okay. It does seem likely that the fix for glibc debuginfo hid the bug in this case. Anyway, I tried the three pre* releases, and got the following outcomes: 6.5-19_preclean.fc6rh: PASS 6.5-19_preppcupdate.fc6rh: FAIL, same infinte loop as 6.5-13 6.5-19_prefull.fc6rh: PASS, with odd error messages: Program received signal SIGINT, Interrupt. warning: In stub for tanh (0x0000003a9a822bd0); interlocked, please submit the binary to http://bugzilla.redhat.com warning: In stub for tanh (0x0000003a9a822bd0); interlocked, please submit the binary to http://bugzilla.redhat.com 0x0000003a9a4c4b1f in *__GI___poll (fds=0x698bce0, nfds=1, timeout=3) at ../sysdeps/ieee754/dbl-64/s_tanh.c:56 56 { warning: In stub for tanh (0x0000003a9a822bd0); interlocked, please submit the binary to http://bugzilla.redhat.com (gdb) bt warning: In stub for tanh (0x0000003a9a822bd0); interlocked, please submit the binary to http://bugzilla.redhat.com #0 0x0000003a9a4c4b1f in *__GI___poll (fds=0x698bce0, nfds=1, timeout=3) at ../sysdeps/ieee754/dbl-64/s_tanh.c:56 #1 0x00002aaaabaccbca in scheduler::poll_events (this=0x695ed30, run_time=3, allowed=1000) at scheduler.cpp:139 #2 0x00002aaaabacce5d in scheduler::tick (this=0x695ed30, min_msecs=6000, max_msecs=6000) at scheduler.cpp:69 ... The stack trace is correct, but it got the filename confused in the first frame, which I'm assuming (?) causes the error about the interlocked stub. As I see the `preppcupdate' variant did not fix it so the workaround was needed. Unfortunately it is a pain to debug it remotely so keeping there the Bugzilla submit request and I hope to get some locally reproducible bugreport later. The message should never appear in cases not resulting in the deadlock anyway. Committed to RawHide as: * Sat Dec 23 2006 Jan Kratochvil <jan.kratochvil> - 6.5-21 - Fix lockup on trampoline vs. its function lookup; unreproducible (BZ 218379). Thanks for the extensive testing and for the effective workaround found this way. Regards and glad to CLOSE it now. And no FC6 update apparently needed. |