Bug 2203919

Summary: python3.9 SIGSEGV during installation and in runtime in RHEL 9.2
Product: Red Hat Enterprise Linux 9 Reporter: Neil Hanlon <neil>
Component: python3.9Assignee: Python Maintainers <python-maint>
Status: CLOSED NOTABUG QA Contact: RHEL CS Apps Subsystem QE <rhel-cs-apps-subsystem-qe>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 9.2CC: dhorak, farrotin, janani, jcajka, neil, redhat-bugzilla, rik.theys, skip, torsava
Target Milestone: rc   
Target Release: ---   
Hardware: ppc64le   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-05-23 09:47:25 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
collection of logs (/var/log, dmesg)
none
core.python3-1
none
core.python3-2
none
core.python3-3
none
Python 3.9 Coredumps
none
Anaconda segfault dmesg output
none
Anaconda pre-load black+green screen
none
Python 3.9 failing build.log when building on POWER9 rev. 2.0 processor none

Description Neil Hanlon 2023-05-15 15:11:39 UTC
Created attachment 1964672 [details]
collection of logs (/var/log, dmesg)

Description of problem:

In RHEL 9.2, when booting from the boot ISO on ppc64le, segfaults are reported by at least one program, and sometimes more than one.

```
[  165.485935] rhsm-service[1628]: segfault (11) at 8 nip 7fffa00daa90 lr 7fffa00dae34 code 1 in libpython3.9.so.1.0[7fffa0000000+380000]
[  165.488917] rhsm-service[1628]: code: 7c84e050 2c090000 7ddc502a 7ebc5214 40c275fc 60000000 3900ffff e94280f0 
[  165.489002] rhsm-service[1628]: code: 7905f80e e9ea0000 2c2f0000 41c271cc <e90e0008> e8e800a8 70e70800 41c24ab4 
[  223.120385] anaconda[1767]: segfault (11) at 8 nip 7fff8d0daa90 lr 7fff8d0dae34 code 1 in libpython3.9.so.1.0[7fff8d000000+380000]
[  223.122915] anaconda[1767]: code: 7c84e050 2c090000 7ddc502a 7ebc5214 40c275fc 60000000 3900ffff e94280f0 
[  223.122969] anaconda[1767]: code: 7905f80e e9ea0000 2c2f0000 41c271cc <e90e0008> e8e800a8 70e70800 41c24ab4 
[  364.434306] rhsm-service[1943]: segfault (11) at 8 nip 7fff962daa90 lr 7fff962dae34 code 1 in libpython3.9.so.1.0[7fff96200000+380000]
[  364.438863] rhsm-service[1943]: code: 7c84e050 2c090000 7ddc502a 7ebc5214 40c275fc 60000000 3900ffff e94280f0 
[  364.438931] rhsm-service[1943]: code: 7905f80e e9ea0000 2c2f0000 41c271cc <e90e0008> e8e800a8 70e70800 41c24ab4 
```

These segfaults did not produce coredumps due to limits.conf settings in the installation environment. It is possible to reproduce this in the installation environment in order to enable core dumps (via ulimit -c unlimited, for example).

Three example core dumps are attached, which were generated by running 'anaconda' from the installation environment after anaconda had initially segfaulted.

During my investigation, I found that the segfault occurs seemingly at random, and appears to be likely due to stack corruption in libpython.

Version-Release number of selected component (if applicable): python3.9-3.9.16-1.el9


How reproducible: almost always. Have reproduced it on multiple QEMU environments ranging from RHEL 8 and 9 to Fedora 38, as well as physical hardware (Talos). All tests were on Power9 Hardware.


Steps to Reproduce:
1. Boot RHEL 9.2 PPC64LE Boot.iso
2. Wait for system to complete dracut and switch root
3. Wait for anaconda to start
4. anaconda should crash, leading to 'Pane is dead' message in tmux. Other segfaults may be witnessed prior to this point

Actual results:

SIGSEGV encountered by libpython3 while importing modules

Expected results:

Anaconda executes normally and system is able to be installed

Additional info:

A tarball containing relevant logs from the booted RHEL 9.2 system is attached.

Some coredumps captured on the RHEL 9.2 boot iso are attached.

Comment 1 Neil Hanlon 2023-05-15 15:15:01 UTC
Created attachment 1964674 [details]
core.python3-1

Comment 2 Neil Hanlon 2023-05-15 15:15:37 UTC
Created attachment 1964675 [details]
core.python3-2

Comment 3 Neil Hanlon 2023-05-15 15:15:47 UTC
Created attachment 1964676 [details]
core.python3-3

Comment 4 Skip Grube 2023-05-15 18:49:52 UTC
The Rocky team has been doing some testing around this issue, and we believe it affects RHEL 9.2, current CentOS 9 Stream, and downstream rebuilds such as ourselves (Rocky Linux).

Greetings from downstream!

Adding some data points to Neil's message, I have taken a functional Rocky Linux 9.1 ppc64le installation (qemu / P-series 7.0 / Power9), and upgraded it to Rocky 9.2.  The behavior is consisent with what he sees in the RHEL 9.2 boot ISO - any program that accesses libpython3.9.so.1.0 in some way tends to break, and the breaks are random.  Sometimes DNF crashes, sometimes it will complete a query.  Sometimes firewalld crashes, or sometimes it comes up as expected.  Rinse and repeat for just about all non-trivial Python 3.9 programs.

Our team has experienced this behavior on both physical hardware (Talos was mentioned above), as well as in Qemu running on top of x86_64 hardware.  The behavior seemed to be the same, which caused us to rule out some kind of qemu implementation issue.

Here is a snippet from dmesg to demonstrate what I mean:

```
[Mon May 15 14:32:50 2023] firewalld[734]: segfault (11) at 8 nip 7fff9cedab94 lr 7fff9cedaf30 code 1 in libpython3.9.so.1.0[7fff9ce00000+380000]
[Mon May 15 14:32:50 2023] firewalld[734]: code: 2c090000 7ddc502a 7ebc5214 41e20008 480088a4 60000000 3900ffff e94280f0 
[Mon May 15 14:32:50 2023] firewalld[734]: code: 7905f80e e9ea0000 2c2f0000 41c26ef0 <e90e0008> e8e800a8 70e70800 41c24ab0 
[Mon May 15 14:33:41 2023] tuned[763]: segfault (11) at a8 nip 7fffac6dab98 lr 7fffac6daf30 code 1 in libpython3.9.so.1.0[7fffac600000+380000]
[Mon May 15 14:33:41 2023] tuned[763]: code: 7ddc502a 7ebc5214 41e20008 480088a4 60000000 3900ffff e94280f0 7905f80e 
[Mon May 15 14:33:41 2023] tuned[763]: code: e9ea0000 2c2f0000 41c26ef0 e90e0008 <e8e800a8> 70e70800 41c24ab0 e9080038 
[Mon May 15 14:34:31 2023] dnf[1157]: segfault (11) at 8 nip 2000366dab94 lr 2000366daf30 code 1 in libpython3.9.so.1.0[200036600000+380000]
[Mon May 15 14:34:31 2023] dnf[1157]: code: 2c090000 7ddc502a 7ebc5214 41e20008 480088a4 60000000 3900ffff e94280f0 
[Mon May 15 14:34:31 2023] dnf[1157]: code: 7905f80e e9ea0000 2c2f0000 41c26ef0 <e90e0008> e8e800a8 70e70800 41c24ab0 
```

I've found calling "sos report" can consistently produce the segfault.  Other programs (firewalld, dnf) may complete or run successfully sometimes, but not all of the time.

I'm attaching coredumps from dnf and sosreport here, I'm hoping they can be useful in some way.

Thank you!

-Skip Grube

Comment 5 Skip Grube 2023-05-15 18:52:12 UTC
Created attachment 1964707 [details]
Python 3.9 Coredumps

These are coredumps and coredumpctl info from a crashing DNF and sosreport on my ppc64le system.

Comment 6 Skip Grube 2023-05-16 03:36:36 UTC
One more data point that could be relevant:

Doing test builds of python3.9-3.9.16-1.el9 on ppc64le to troubleshoot this issue, One of the subprocess tests in %check seems to fail intermittently.  I'm not sure if this is related to the segfaulting issue, but it seemed suspicious.  I'm running mock with isolation=simple, if that makes any difference.  Here is the failing test output from build.log:

```
test_close_fds (test.test_subprocess.POSIXProcessTestCase) ... Timeout (0:30:00)!
Thread 0x00007fffb4d53ba0 (most recent call first):
  File "/builddir/build/BUILD/Python-3.9.16/Lib/subprocess.py", line 1121 in communicate
  File "/builddir/build/BUILD/Python-3.9.16/Lib/test/test_subprocess.py", line 2683 in test_close_fds
  File "/builddir/build/BUILD/Python-3.9.16/Lib/unittest/case.py", line 550 in _callTestMethod
  File "/builddir/build/BUILD/Python-3.9.16/Lib/unittest/case.py", line 592 in run
  File "/builddir/build/BUILD/Python-3.9.16/Lib/unittest/case.py", line 651 in __call__
  File "/builddir/build/BUILD/Python-3.9.16/Lib/unittest/suite.py", line 122 in run
  File "/builddir/build/BUILD/Python-3.9.16/Lib/unittest/suite.py", line 84 in __call__
  File "/builddir/build/BUILD/Python-3.9.16/Lib/unittest/suite.py", line 122 in run
  File "/builddir/build/BUILD/Python-3.9.16/Lib/unittest/suite.py", line 84 in __call__
  File "/builddir/build/BUILD/Python-3.9.16/Lib/unittest/suite.py", line 122 in run
  File "/builddir/build/BUILD/Python-3.9.16/Lib/unittest/suite.py", line 84 in __call__
  File "/builddir/build/BUILD/Python-3.9.16/Lib/unittest/runner.py", line 184 in run
  File "/builddir/build/BUILD/Python-3.9.16/Lib/test/support/__init__.py", line 1850 in _run_suite
  File "/builddir/build/BUILD/Python-3.9.16/Lib/test/support/__init__.py", line 1974 in run_unittest
  File "/builddir/build/BUILD/Python-3.9.16/Lib/test/libregrtest/runtest.py", line 263 in _test_module
  File "/builddir/build/BUILD/Python-3.9.16/Lib/test/libregrtest/runtest.py", line 288 in _runtest_inner2
  File "/builddir/build/BUILD/Python-3.9.16/Lib/test/libregrtest/runtest.py", line 326 in _runtest_inner
  File "/builddir/build/BUILD/Python-3.9.16/Lib/test/libregrtest/runtest.py", line 217 in _runtest
  File "/builddir/build/BUILD/Python-3.9.16/Lib/test/libregrtest/runtest.py", line 247 in runtest
  File "/builddir/build/BUILD/Python-3.9.16/Lib/test/libregrtest/main.py", line 334 in rerun_failed_tests
  File "/builddir/build/BUILD/Python-3.9.16/Lib/test/libregrtest/main.py", line 716 in _main
  File "/builddir/build/BUILD/Python-3.9.16/Lib/test/libregrtest/main.py", line 672 in main
  File "/builddir/build/BUILD/Python-3.9.16/Lib/test/libregrtest/main.py", line 733 in main
  File "/builddir/build/BUILD/Python-3.9.16/Lib/test/regrtest.py", line 43 in _main
  File "/builddir/build/BUILD/Python-3.9.16/Lib/test/regrtest.py", line 47 in <module>
  File "/builddir/build/BUILD/Python-3.9.16/Lib/runpy.py", line 87 in _run_code
  File "/builddir/build/BUILD/Python-3.9.16/Lib/runpy.py", line 197 in _run_module_as_main
```

Comment 7 janani 2023-05-16 20:12:40 UTC
Have you tried installing as a fresh install in a container just to rule out any cross contamination.

Comment 8 Neil Hanlon 2023-05-16 20:16:53 UTC
We are able to reliably reproduce the issue using the RHEL 9.2 for Power (LE) ISOs, as well as upgrading from the RHEL 9.1 ISOs to 9.2.

Comment 9 Tomas Orsava 2023-05-17 12:50:15 UTC
Hi,
could you please provide more info on the ppc64le machine?
Ideally a step by step instructions to set up qemu virtual machine on a Fedora x86_64 host.

During our testing we use a pxe boot and we haven't run into the issue.

Comment 10 Skip Grube 2023-05-17 17:58:09 UTC
Sure, just installed Fedora 38 on an Intel-based laptop.  Here are my notes:

- Install Fedora 38 (latest workstation) from public ISO

- dnf -y update  && reboot

- dnf install qemu-system-ppc virt-manager

- usermod -G qemu -a skip;  usermod -G kvm -a skip  (then log out/log back in)

- Launch virt-manager:
  - New machine
  - Local media install, Architecture: ppc64le, Machine Type: pseries
  - Add iso location, choose the rhel-9.2 ppc64le dvd ISO file as the install media
  - Choose cpu/memory sizing: I used 2 cpu cores, 4096 MB for testing
  - New disk, 30 GB in size
  - Default network settings (NAT), customize configuration before install
  - Change HD controller to SCSI from virtio (Ensuring more accurate setup of actual P-series hardware)
  - Ensure display device is set to "VGA" (should be the default)
  - Launch VM with "Begin Installation"
  
- The ISO will boot, and Anaconda will attempt to load
- Things won't proceed beyond that black screen with the green bar.  Anaconda will cause a segfault in libpython3.9.so.1.  Other Python-related programs may also cause this.
- I'm attaching 2 screenshots from my session performing this - both the black screen and dmesg output showing the segfaults


It's worth noting that this is not just an Anaconda issue, it appears to be anything in Python.  We've also triggered it by taking a 9.1 ppc64le system and updating it to 9.2, then running "sos report"

sosreport calls many things written in Python, and is itself Python-based.  It tends to trigger segfaults in all manner of Python-linked programs:  dnf, firewalld, tuned, etc.

Thanks, hope this helps

-Skip Grube

Comment 11 Skip Grube 2023-05-17 17:59:02 UTC
Created attachment 1965186 [details]
Anaconda segfault dmesg output

Comment 12 Skip Grube 2023-05-17 17:59:47 UTC
Created attachment 1965187 [details]
Anaconda pre-load black+green screen

Comment 13 Dan Horák 2023-05-19 08:24:35 UTC
Please let's focus on reproducing on real hw, which is also said to be affected, to avoid any possible cross-arch emulation issues.

Comment 15 Dan Horák 2023-05-19 12:32:22 UTC
10 out of 10 installs of RHEL 9.2 GA went OK for me in a KVM guest on a P9 host (Boston), it makes me believe there is a specific bit in the reporter's setup that exposes the misbehaviour ... Also I don't think it has been noticed during the internal testing of RHEL 9.2

Comment 16 Skip Grube 2023-05-19 15:59:06 UTC
I believe I've found the root cause of what I'm experiencing.

After testing on different hosts, both emulated and physical, we noticed a pattern: in /proc/cpuinfo , the POWER9 processors have different revisions depending on platform:
```
revision    : 2.3 (pvr 004e 1203)
revision    : 2.2 (pvr 004e 1202)
revision    : 2.0 (pvr 004e 1200)
```

I've been able to reproduce this issue on hosts with the 2.0/1200 revision, but not 2.2 or 2.3.  Our test on a Raptor Talos II workstation (physical) with the 2.0 revision manifested the issue.  But on an IBM 9006-12P (and KVM guests), the CPU revision is 2.3/1203 - and the problem is absent.  All of the qemu-emulated environments I test on locally use the 2.0 revision.  
 

Further curious, I attempted to build python3.9 from RHEL 9.2 on a machine with the older 2.0 revision (mock, 9.2 buildroot).  Gcc fails with an internal compiler error in tree-ssa-coalesce.c when attempting this.  I'm attaching the build.log in case there's any interest.  I can confirm that RHEL 9.1 works properly on the rev 2.0 power9's as far as I can tell.


This may end up being very minor: I'm not sure how much IBM hardware is out there that has this older processor revision, if any.  It should probably be noted somewhere that 9.2 will likely be unstable when using this revision of POWER9.  I believe we noticed the Python failures because it gets called a lot and is easy to trigger, but I imagine there are other packages that must be affected by this.


I apologize if I sounded alarmist earlier.  Too many of our test suites run on the older processors, and we falsely assumed it affected all of ppc64le.  This might be a non-issue depending on the relative popularity of that 2.0 revision.  I'm not at all familiar with how PowerPC internals or processor revisions work, but I'm sure there are engineers at Red Hat who are!


Thanks, and I hope this helps!  I'm attaching my failing Python build.log as a reference in case there's interest in replicating it.


-Skip Grube

Comment 17 Dan Horák 2023-05-19 16:13:11 UTC
I think it makes sense now, the first production release (revision) of P9 is 2.2, with 2.1 having some bugs in virt support (see https://wiki.raptorcs.com/w/images/5/50/POWER9_dd21_use_restrictions_v10_05JUN2019_pub.pdf for details) and 2.0 having even more bugs (and shouldn't be used at all for any production work).

Comment 18 Skip Grube 2023-05-19 16:13:23 UTC
Created attachment 1965743 [details]
Python 3.9 failing build.log when building on POWER9 rev. 2.0 processor

Comment 21 Miro Hrončok 2023-05-23 09:47:25 UTC
Apologies, but apparently running on pre-production hw/cpu isn't supported by Red Hat, so we won't consider this a bug.

I'd be happy to review a merge request to Python, but as said above this probably isn't limited to Python at all.

Comment 22 Neil Hanlon 2023-05-23 13:27:47 UTC
Hi Miro and Dan,

Thank you for your investigation into this issue. We have done some additional testing and have a few points/questions to help us identify what is going on.

From our perspective, this is a full regression from a previously working state. While we understand that the hw/cpu combination is not supported by Red Hat, it is clear that there is some regression in a component causing even the latest QEMU from Fedora to fail to boot the RHEL 9.2 DVD. We would like to understand what this is, and why.

Notably, I have booted a ppc qemu VM with the latest SLOF[1] firmware passed to QEMU `-bios` option, and even still am unable to get /proc/cpuinfo to display the 'correct' string, even though, as near as I can tell, the latest firmware from SLOF should not be subject to these pre-production issues, but nonetheless causes repeatable segfaults with the current RHEL 9 release.

I began looking into the qemu source code, and as near as I can tell, the power9_v2.0 string is something that is statically coded[2] into the source--not actually being presented by the firmware in any meaningful fashion. Put another way: I believe our qemu tests are reporting that they are "power9_v2.0", but in reality are booting the latest SLOF, and so from my understanding should not have this issue. Notably, qemu appears to only define up to POWER9 DD2.0 [3], and I could not locate any RHEL-specific patches being applied which override or supplement this.

I am more than happy to re-file a bug against another component and/or product, if that is a more appropriate place for the bug. As it stands, we cannot find a reliable way to run the 9.2 ISOs under Power Emulation. We have yet to try to flash a new firmware to the one physical host we have.

Any help you can provide would be much appreciated.


[1] https://github.com/aik/SLOF 
[2] https://github.com/qemu/qemu/blob/aa222a8e4f975284b3f8f131653a4114b3d333b3/hw/ppc/spapr.c#L4634
[3] https://github.com/qemu/qemu/blob/363fd548abd5fbef040ee001c6694672bfb0d798/target/ppc/cpu-models.h#L375

Comment 23 Neil Hanlon 2023-05-23 13:53:31 UTC
follow up:

I'm chatting with #qemu on irc.oftc.net and they've alerted me to the fact that there are some patches for POWER9 DD2.2 (and 2.3) in the pipeline [1], as well as suggested I open a gitlab ticket to track this, which I will do later on today and provide a link back here to complete the circle.

Thank you again for your assistance in this matter! We're glad that this appears largely environmental, at its root.

[1] https://lists.nongnu.org/archive/html/qemu-ppc/2023-05/msg00161.html