Bug 1866884
Summary: | AArch64: sometimes, gdb fails to load symbols of a dynamic library with a pending breakpoint | ||
---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Victor Stinner <vstinner> |
Component: | gdb | Assignee: | Kevin Buettner <kevinb> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | Fedora Extras Quality Assurance <extras-qa> |
Severity: | unspecified | Docs Contact: | |
Priority: | unspecified | ||
Version: | 33 | CC: | cstratak, jan.kratochvil, keiths, kevinb, pmuldoon, sergiodj |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2020-12-04 16:46:19 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Victor Stinner
2020-08-06 17:23:48 UTC
Here is a variant which only uses gdb to reproduce the bug (it doesn't use python-gdb.py nor the Python test suite). It seems like there are different cases: * Sometimes, everything works * Sometimes, Python is stopped at the breakpoint but "bt" fails to display symbols. * Sometimes, Python does not stop at the breakpoint. "break meth_fastcall": this function is defined in _testcapi.cpython-310d-aarch64-linux-gnu.so, a dynamic library which is loaded after Python is started, by the Python "import _testcapi" command. Steps to Reproduce: 1. git clone https://github.com/python/cpython 2. cd cpython 3. ./configure --with-pydebug # Python uses GCC -Og by default 4. make 5. rm -f python-gdb.py 6. ./script2.sh Using the three following files. Note on (6): You can try to run ./script2.sh multiple times in parallel to make the bug more likely. script2.sh (executable shell script): --- OUT=$(mktemp) while true; do gdb --batch -x cmds2 -args ./python -S x.py < /dev/null 2>&1 | tee $OUT grep -q -F 'Breakpoint 1, meth_fastcall' $OUT || break echo echo "====================================================" echo done echo "#######################################################" echo "BUG!" echo "#######################################################" cat $OUT echo "#######################################################" rm -f $OUT --- x.py: --- import _testcapi; _testcapi.meth_fastcall() --- cmds2: --- set breakpoint pending on set print address off break meth_fastcall run set print entry-values no bt --- I failed to reproduce the issue on Fedora 32 AArch64 with gdb-9.1-5.fc32.aarch64. It seems to be a regression introduced between gdb-9.1-5.fc32.aarch64 and gdb-9.2-2.fc33.aarch64. I tried methods of Comment 0 and Comment 1: * Comment 0 method: I ran "./python -m test -v test_gdb -m test_pycfunction -F -j8" for 8 minutes, no failure. It runs the test 8x in parallel. * Comment 1 method: I ran the script 6x in parallel for 5 min, while I stressed the machine to increase the system load (around 23.48 on a machine with 8 CPUs). By the way, copy/paste is boring, so I wrote a single script for comment 1 method. reproducer.sh: --- SCRIPT=script.py CMDS=cmds OUT=$(mktemp) if [ ! -e $SCRIPT ]; then echo 'import _testcapi; _testcapi.meth_fastcall()' >$SCRIPT fi if [ ! -e $CMDS ]; then cat >$CMDS <<EOF set breakpoint pending on set print address off break meth_fastcall run set print entry-values no bt EOF fi rm -f python-gdb.py while true; do gdb --batch -x $CMDS -args ./python -S $SCRIPT < /dev/null 2>&1 | tee $OUT grep -q -F 'Breakpoint 1, meth_fastcall' $OUT || break echo echo "====================================================" echo done echo "#######################################################" echo "BUG!" echo "#######################################################" cat $OUT echo "#######################################################" rm -f $OUT --- A data point... I grabbed a beaker machine, ran *ten* parallel looping builds of gdb on it (to load close to 50 peak), and then ran five instances of the reproducer script you provided (THANK YOU SO MUCH!). Unfortunately, after 60,000+ iterations of the script over the weekend, never once did the bug appear. Since the differences between 9.1 and 9.2 are so minor, I can't help but wonder if this isn't caused or exasperated by something in the compose, such as glibc or kernel. I also wonder if this was fixed in some recent/pending update. I tested on the Fedora-Rawhide-20200805.n.0 Server compose. The next steps (beyond updating the server, which is, as I read it, not possible) are to try a 9.2 build in your system or even FSF master. Let me know if you would like me to provide you such a build. This bug appears to have been reported against 'rawhide' during the Fedora 33 development cycle. Changing version to 33. Using reproducer.sh from Comment 2, I've been able to reproduce this problem on both aarch64 and x86_64. In both cases, I'm using an up-to-date Fedora 33 (Up-to-date on the date of this comment.) For aarch64, I'm running in a QEMU-emulated VM that's been configured with 8 cores. For x86_64, I set up a 4 core VM. I'm seeing five different behaviors: 1) reproducer.sh loops repeatedly, which is the correct behavior. 2) BUG is printed along with the following output: Function "meth_fastcall" not defined. Breakpoint 1 (meth_fastcall) pending. [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". 3) BUG is printed along with the following output: Function "meth_fastcall" not defined. Breakpoint 1 (meth_fastcall) pending. 4) GDB (presumably) hangs after printing output shown in #2, above. 5) GDB (again, presumably) hangs after printing backtrace. The most common behaviors that I see are #4 (the hang) and #2. Behaviors #3 and #5 are much more rare. I've observed at least some of these behaviors using gdb-9.2-7 and also using a GDB binary built from current upstream sources without any Fedora patches applied. ping. Any update here? (In reply to Charalampos Stratakis from comment #6) > ping. Any update here? I'm rebasing Fedora 33 GDB to upstream 10.1 branch. I will retest once this release is made. I think that this bug might be fixed in gdb-10.1-2. Using three different VMs, two x86_64 (one 32 core, one 4 core) VMs, and one aarch64 VM (8 core, emulated w/ QEMU), I have not yet been able to reproduce any of the behaviors that I've outlined in Comment 5. On the x86_64 32 core VM, I'm running the reproducer script concurrently in 32 gnome-terminal tabs. The load average on that VM is over 60. On the x86_64 4 core VM, I'm running the reproducer script concurrently in 8 gnome-terminal tabs w/ a load average of 10 or higher. On the aarch64 (emulated) 8 core VM, I'm running the reproducer script concurrently in 9 gnome-terminal tabs. It shows a load average over 12. The load average on the virtualization host upon which all of these machines are running (in addition to several other VMs) is over 40. I used to be able to reproduce at least some of the behaviors noted in Comment 5 within a few minutes. I've been running the reproducer script for several hours now without problem. I'd like to close this bug, but will leave it open for several more days in order to give Victor a chance to retest with GDB 10.1. Also, I'm going to leave the reproducer scripts running for several more hours and will post an update if I see the original bug or any of the other problems identified in comment 5. I confirm that gdb-10.1-1.fc34.aarch64 fix the issue. Thanks! -- I tested gdb-10.1-1.fc34.aarch64 on python-builder-fedora-rawhide-aarch64 (machine with 8 CPUs): the machine where I initially reproduced the race condition. I ran reproducer.sh of Comment 2 for 15 min in 6 terminals in parallel (in tmux): I failed reproduced the issue. The system load was around 8. I also modified the Python code base to no longer skip test_gdb on gdb 9.2+: diff --git a/Lib/test/test_gdb.py b/Lib/test/test_gdb.py index 44cb9a0f07..22c75bae98 100644 --- a/Lib/test/test_gdb.py +++ b/Lib/test/test_gdb.py @@ -51,11 +51,6 @@ def get_gdb_version(): "embedding. Saw %s.%s:\n%s" % (gdb_major_version, gdb_minor_version, gdb_version)) -if (gdb_major_version, gdb_minor_version) >= (9, 2): - # gdb 9.2 on Fedora Rawhide is not reliable, see: - # * https://bugs.python.org/issue41473 - # * https://bugzilla.redhat.com/show_bug.cgi?id=1866884 - raise unittest.SkipTest("https://bugzilla.redhat.com/show_bug.cgi?id=1866884") if not sysconfig.is_python_build(): raise unittest.SkipTest("test_gdb only works on source builds at the moment.") I also ran "./python -m test -v test_gdb -m test_pycfunction -F" for 15 min in 6 terminals in parallel (in tmux). The system load wa around 8. I wrote https://github.com/python/cpython/pull/23637 to reenable Python test_gdb on gdb 9.2 and newer. Thanks Kevin for tracking this issue! It's good to see it resolved (be able to reenable test_gdb on Python)! |