Bug 1972293

Summary: Python36 crashes with libgcc_s.so.1 must be installed for pthread_cancel to work
Product: Red Hat Enterprise Linux 8 Reporter: Jiri Danek <jdanek>
Component: python3Assignee: Python Maintainers <python-maint>
Status: CLOSED ERRATA QA Contact: Lukáš Zachar <lzachar>
Severity: unspecified Docs Contact:
Priority: low    
Version: 8.4CC: cstratak, pematous, pviktori, vstinner
Target Milestone: betaKeywords: Triaged
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: python3-3.6.8-39.el8 Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-11-09 19:39:37 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
threads.py none

Description Jiri Danek 2021-06-15 15:34:39 UTC
Created attachment 1791284 [details]
threads.py

Description of problem:
***********************

Python crashes with a core dump when I run the attached reproducer program after setting limit on file descriptors.

In practice, this issue was originally encountered as

https://issues.redhat.com/browse/ENTMQCL-1699
https://issues.redhat.com/browse/ENTMQCL-2787

This issue manifests both on RHEL 7 and RHEL 8 with Python version 3.6. It does not manifest with Python version 2.7 (on RHEL 7).

Steps to reproduce:
*******************

# bash
# prlimit --pid $$ --nofile=5:5
# python theads.py

Exception in thread Thread-5:
Traceback (most recent call last):
  File "/usr/lib64/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/usr/lib64/python3.6/threading.py", line 864, in run
    self._target(*self._args, **self._kwargs)
  File "loop.py", line 6, in run_in_thread
Exception: aaaaaaaaaa

libgcc_s.so.1 must be installed for pthread_cancel to work

Workaround:
***********

# export LD_PRELOAD=/usr/lib64/libgcc_s.so.1

Version-Release number of selected component (if applicable):
*************************************************************

latest docker run --rm -it registry.access.redhat.com/ubi8/ubi
python36 3.6.8-2.module+el8.1.0+3334+5cb623d7 from ubi-8-appstream

How reproducible:
*****************

Intermittently, but happens fairly frequently using the attached reproducer. About 10 attempts at running the reproduction steps should be sufficient to reproduce.

Stacktrace:
***********

There is a stacktrace in comments for https://issues.redhat.com/browse/ENTMQCL-1699. I was not able to get a corefile now, when reproducing the issue in docker. The core file is not created and coredump ctl does not report any cores.

Comment 1 Jiri Danek 2021-06-15 15:38:50 UTC
When searching bugzilla, I found two similar issues, neither seems to be a duplicate, or provide to me any hints.

https://bugzilla.redhat.com/show_bug.cgi?id=767094
https://bugzilla.redhat.com/show_bug.cgi?id=104173

Comment 2 Miro Hrončok 2021-06-15 20:29:51 UTC
For the record, I see this in Fedora Rawhide container as well with Python 3.6 as well as Python 3.9 but not with Python 3.10.

Comment 3 Miro Hrončok 2021-06-15 20:33:37 UTC
https://bugs.python.org/issue18748 might be relevant

Comment 4 Miro Hrončok 2021-06-15 20:40:56 UTC
(In reply to Miro Hrončok from comment #2)
> For the record, I see this in Fedora Rawhide container as well with Python
> 3.6 as well as Python 3.9 but not with Python 3.10.

Actually, I was piping the output to `more` and when I don't do that, I cannot reproduce this with Python 3.9.

I can reproduce this on Fedora Rawhide with Python 3.6 and 3.7, but not in 3.8+.

That kinda supports the idea that this was fixed via https://bugs.python.org/issue18748

Comment 5 Victor Stinner 2021-06-16 15:05:26 UTC
> This issue manifests both on RHEL 7 and RHEL 8 with Python version 3.6. It does not manifest with Python version 2.7 (on RHEL 7).

Oh right, in Python 2.7, _thread.start_new_thread() doesn't call pthread_cancel() at the thread exit. It does in Python 3.6.

The pthread_cancel() call is redundant and can be removed. Removing the call fixes this race condition.

I proposed exactly that in Python upstream:

* https://bugs.python.org/issue44434
* https://github.com/python/cpython/pull/26758


"How reproducible: Intermittently, but happens fairly frequently using the attached reproducer. About 10 attempts at running the reproduction steps should be sufficient to reproduce."

Right, even if the file descriptor limit is very low (5), it remains hard to trigger the issue with 1000 threads. The race condition is hard to trigger. I attached 2 different reproducer scripts to https://bugs.python.org/issue44434 which make the race condition more likely.

It seems like sometimes the libgcc_s library is loaded early during Python startup. Sometimes, it only loaded when the first thread exits. Sometimes, it goes fine. Sometimes, I get the abort() call with error message.


"Workaround: export LD_PRELOAD=/usr/lib64/libgcc_s.so.1"

Another is to use a larger file descriptor limit, but it only makes the race condition less likely, it doesn't fully fix it.

Comment 6 Victor Stinner 2021-06-21 12:32:25 UTC
Ok, the issue is now fixed in Python upstream in 3.9, 3.10 and main branches: https://bugs.python.org/issue44434

> This issue manifests both on RHEL 7 and RHEL 8 with Python version 3.6. It does not manifest with Python version 2.7 (on RHEL 7).

Jiri Danek: Do you need a backport to Python 3.6 of RHEL7 and RHEL8, or is the "export LD_PRELOAD=/usr/lib64/libgcc_s.so.1" workaround acceptable for your use case?

Comment 7 Jiri Danek 2021-06-22 14:53:40 UTC
> Jiri Danek: Do you need a backport to Python 3.6 of RHEL7 and RHEL8 [...]?

TBH, I don't know. We only hit this issue during testing, it does not have an associated customer case. For testing the EMFILE error handling in Qpid Proton Python library, I feel that `export LD_PRELOAD=/usr/lib64/libgcc_s.so.1` workaround is perfectly satisfactory; now that we understand what's actually happening. Whether there is sufficient value in fixing the CPython interpreter itself, I can't tell. Proton in general was never all that good at handling resource exhaustion cases and given that prior experience, no-one really expects it to excel in this area. Meaning this sort of resiliency is not a crucial feature of the product. I will ask around the team and I will update here.

Comment 8 Jiri Danek 2021-06-30 12:29:23 UTC
We discussed this on AMQ Clients project meeting. We think this issue should be fixed as part of regular RHEL bugfix erratas since it potentially affects all Python 3.6 users.

Comment 16 errata-xmlrpc 2021-11-09 19:39:37 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: python3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:4399