Bug 1866884

Summary:	AArch64: sometimes, gdb fails to load symbols of a dynamic library with a pending breakpoint
Product:	[Fedora] Fedora	Reporter:	Victor Stinner <vstinner>
Component:	gdb	Assignee:	Kevin Buettner <kevinb>
Status:	CLOSED CURRENTRELEASE	QA Contact:	Fedora Extras Quality Assurance <extras-qa>
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	33	CC:	cstratak, jan.kratochvil, keiths, kevinb, pmuldoon, sergiodj
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-12-04 16:46:19 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Victor Stinner 2020-08-06 17:23:48 UTC

Description of problem:

The test_gdb test of Python fails randomly on Fedora Rawhide AArch64. I tested the master branch of Python.

Version-Release number of selected component (if applicable):

* gcc-10.2.1-1.fc33.aarch64 ("gcc (GCC) 10.2.1 20200723 (Red Hat 10.2.1-1)")
* gdb-9.2-2.fc33.aarch64 ("GNU gdb (GDB) Fedora 9.2-2.fc33")
* master branch of Python


How reproducible:

It is not easy to reproduce. It looks like a race condition which depends on the system load. I reproduce the issue on a server where I don't control the system load.


Steps to Reproduce using python-gdb.py and Python test suite:

1. git clone https://github.com/python/cpython
2. cd cpython
3. ./configure --with-pydebug  # Python uses GCC by default
4. make
5. ./python -m test -v test_gdb  -m test_pycfunction  -F


I'm trying to write a smaller reproducer script, but since the bug only triggers randomly, it's hard to simplify the test case.


The issue was first reported to Python: https://bugs.python.org/issue41473#msg374825.

Comment 1 Victor Stinner 2020-08-06 17:33:09 UTC

Here is a variant which only uses gdb to reproduce the bug (it doesn't use python-gdb.py nor the Python test suite).

It seems like there are different cases:

* Sometimes, everything works
* Sometimes, Python is stopped at the breakpoint but "bt" fails to display symbols.
* Sometimes, Python does not stop at the breakpoint.


"break meth_fastcall": this function is defined in _testcapi.cpython-310d-aarch64-linux-gnu.so, a dynamic library which is loaded after Python is started, by the Python "import _testcapi" command.


Steps to Reproduce:

1. git clone https://github.com/python/cpython
2. cd cpython
3. ./configure --with-pydebug  # Python uses GCC -Og by default
4. make
5. rm -f python-gdb.py 
6. ./script2.sh

Using the three following files.

Note on (6): You can try to run ./script2.sh multiple times in parallel to make the bug more likely.


script2.sh (executable shell script):
---
OUT=$(mktemp)
while true; do
  gdb --batch -x cmds2 -args ./python -S x.py < /dev/null 2>&1 | tee $OUT 
  grep -q -F 'Breakpoint 1, meth_fastcall' $OUT || break

  echo
  echo "===================================================="
  echo
done

echo "#######################################################"
echo "BUG!"
echo "#######################################################"
cat $OUT
echo "#######################################################"

rm -f $OUT
---


x.py:
---
import _testcapi; _testcapi.meth_fastcall()
---


cmds2:
---
set breakpoint pending on
set print address off
break meth_fastcall
run
set print entry-values no
bt
---

Comment 2 Victor Stinner 2020-08-07 12:54:05 UTC

I failed to reproduce the issue on Fedora 32 AArch64 with gdb-9.1-5.fc32.aarch64. It seems to be a regression introduced between gdb-9.1-5.fc32.aarch64 and gdb-9.2-2.fc33.aarch64.

I tried methods of Comment 0 and Comment 1:

* Comment 0 method: I ran "./python -m test -v test_gdb  -m test_pycfunction  -F -j8" for 8 minutes, no failure. It runs the test 8x in parallel.
* Comment 1 method: I ran the script 6x in parallel for 5 min, while I stressed the machine to increase the system load (around 23.48 on a machine with 8 CPUs).

By the way, copy/paste is boring, so I wrote a single script for comment 1 method.

reproducer.sh:
---
SCRIPT=script.py
CMDS=cmds
OUT=$(mktemp)

if [ ! -e $SCRIPT ]; then
    echo 'import _testcapi; _testcapi.meth_fastcall()' >$SCRIPT
fi
if [ ! -e $CMDS ]; then
    cat >$CMDS <<EOF
set breakpoint pending on
set print address off
break meth_fastcall
run
set print entry-values no
bt
EOF

fi

rm -f python-gdb.py

while true; do
  gdb --batch -x $CMDS -args ./python -S $SCRIPT < /dev/null 2>&1 | tee $OUT 
  grep -q -F 'Breakpoint 1, meth_fastcall' $OUT || break

  echo
  echo "===================================================="
  echo
done

echo "#######################################################"
echo "BUG!"
echo "#######################################################"
cat $OUT
echo "#######################################################"

rm -f $OUT
---

Comment 3 Keith Seitz 2020-08-10 17:13:55 UTC

A data point...  I grabbed a beaker machine, ran *ten* parallel looping builds of gdb on it (to load close to 50 peak), and then ran five instances of the reproducer script you provided (THANK YOU SO MUCH!).

Unfortunately, after 60,000+ iterations of the script over the weekend, never once did the bug appear.

Since the differences between 9.1 and 9.2 are so minor, I can't help but wonder if this isn't caused or exasperated by something in the compose, such as glibc or kernel. I also wonder if this was fixed in some recent/pending update. I tested on the Fedora-Rawhide-20200805.n.0 Server compose.

The next steps (beyond updating the server, which is, as I read it, not possible) are to try a 9.2 build in your system or even FSF master. Let me know if you would like me to provide you such a build.

Comment 4 Ben Cotton 2020-08-11 15:18:00 UTC

This bug appears to have been reported against 'rawhide' during the Fedora 33 development cycle.
Changing version to 33.

Comment 5 Kevin Buettner 2020-09-03 20:09:58 UTC

Using reproducer.sh from Comment 2, I've been able to reproduce this problem on both aarch64 and x86_64.  In both cases, I'm using an up-to-date Fedora 33 (Up-to-date on the date of this comment.)  For aarch64, I'm running in a QEMU-emulated VM that's been configured with 8 cores. For x86_64, I set up a 4 core VM.

I'm seeing five different behaviors:

1) reproducer.sh loops repeatedly, which is the correct behavior.

2) BUG is printed along with the following output:

Function "meth_fastcall" not defined.
Breakpoint 1 (meth_fastcall) pending.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".

3) BUG is printed along with the following output:

Function "meth_fastcall" not defined.
Breakpoint 1 (meth_fastcall) pending.

4) GDB (presumably) hangs after printing output shown in #2, above.

5) GDB (again, presumably) hangs after printing backtrace.

The most common behaviors that I see are #4 (the hang) and #2.  Behaviors #3 and #5 are much more rare.

I've observed at least some of these behaviors using gdb-9.2-7 and also using a GDB binary built from current upstream sources without any Fedora patches applied.

Comment 6 Charalampos Stratakis 2020-10-14 14:25:24 UTC

ping. Any update here?

Comment 7 Kevin Buettner 2020-10-14 16:18:42 UTC

(In reply to Charalampos Stratakis from comment #6)
> ping. Any update here?

I'm rebasing Fedora 33 GDB to upstream 10.1 branch. I will retest once this release is made.

Comment 8 Kevin Buettner 2020-12-03 20:54:07 UTC

I think that this bug might be fixed in gdb-10.1-2.

Using three different VMs, two x86_64 (one 32 core, one 4 core) VMs, and one aarch64 VM (8 core, emulated w/ QEMU), I have not yet been able to reproduce any of the behaviors that I've outlined in Comment 5.

On the x86_64 32 core VM, I'm running the reproducer script concurrently in 32 gnome-terminal tabs. The load average on that VM is over 60.

On the x86_64 4 core VM, I'm running the reproducer script concurrently in 8 gnome-terminal tabs w/ a load average of 10 or higher.

On the aarch64 (emulated) 8 core VM, I'm running the reproducer script concurrently in 9 gnome-terminal tabs.  It shows a load average over 12.

The load average on the virtualization host upon which all of these machines are running (in addition to several other VMs) is over 40.

I used to be able to reproduce at least some of the behaviors noted in Comment 5 within a few minutes.  I've been running the reproducer script for several hours now without problem.

I'd like to close this bug, but will leave it open for several more days in order to give Victor a chance to retest with GDB 10.1.  Also, I'm going to leave the reproducer scripts running for several more hours and will post an update if I see the original bug or any of the other problems identified in comment 5.

Comment 9 Victor Stinner 2020-12-04 10:40:21 UTC

I confirm that gdb-10.1-1.fc34.aarch64 fix the issue. Thanks!

--

I tested gdb-10.1-1.fc34.aarch64 on python-builder-fedora-rawhide-aarch64 (machine with 8 CPUs): the machine where I initially reproduced the race condition.

I ran reproducer.sh of Comment 2 for 15 min in 6 terminals in parallel (in tmux): I failed reproduced the issue. The system load was around 8.

I also modified the Python code base to no longer skip test_gdb on gdb 9.2+:

diff --git a/Lib/test/test_gdb.py b/Lib/test/test_gdb.py
index 44cb9a0f07..22c75bae98 100644
--- a/Lib/test/test_gdb.py
+++ b/Lib/test/test_gdb.py
@@ -51,11 +51,6 @@ def get_gdb_version():
                             "embedding. Saw %s.%s:\n%s"
                             % (gdb_major_version, gdb_minor_version,
                                gdb_version))
-if (gdb_major_version, gdb_minor_version) >= (9, 2):
-    # gdb 9.2 on Fedora Rawhide is not reliable, see:
-    # * https://bugs.python.org/issue41473
-    # * https://bugzilla.redhat.com/show_bug.cgi?id=1866884
-    raise unittest.SkipTest("https://bugzilla.redhat.com/show_bug.cgi?id=1866884")
 
 if not sysconfig.is_python_build():
     raise unittest.SkipTest("test_gdb only works on source builds at the moment.")


I also ran "./python -m test -v test_gdb -m test_pycfunction -F" for 15 min in 6 terminals in parallel (in tmux). The system load wa around 8.

Comment 10 Victor Stinner 2020-12-04 10:48:16 UTC

I wrote https://github.com/python/cpython/pull/23637 to reenable Python test_gdb on gdb 9.2 and newer.

Comment 11 Victor Stinner 2020-12-04 17:18:53 UTC

Thanks Kevin for tracking this issue! It's good to see it resolved (be able to reenable test_gdb on Python)!