Hide Forgot
Created attachment 1791284 [details] threads.py Description of problem: *********************** Python crashes with a core dump when I run the attached reproducer program after setting limit on file descriptors. In practice, this issue was originally encountered as https://issues.redhat.com/browse/ENTMQCL-1699 https://issues.redhat.com/browse/ENTMQCL-2787 This issue manifests both on RHEL 7 and RHEL 8 with Python version 3.6. It does not manifest with Python version 2.7 (on RHEL 7). Steps to reproduce: ******************* # bash # prlimit --pid $$ --nofile=5:5 # python theads.py Exception in thread Thread-5: Traceback (most recent call last): File "/usr/lib64/python3.6/threading.py", line 916, in _bootstrap_inner self.run() File "/usr/lib64/python3.6/threading.py", line 864, in run self._target(*self._args, **self._kwargs) File "loop.py", line 6, in run_in_thread Exception: aaaaaaaaaa libgcc_s.so.1 must be installed for pthread_cancel to work Workaround: *********** # export LD_PRELOAD=/usr/lib64/libgcc_s.so.1 Version-Release number of selected component (if applicable): ************************************************************* latest docker run --rm -it registry.access.redhat.com/ubi8/ubi python36 3.6.8-2.module+el8.1.0+3334+5cb623d7 from ubi-8-appstream How reproducible: ***************** Intermittently, but happens fairly frequently using the attached reproducer. About 10 attempts at running the reproduction steps should be sufficient to reproduce. Stacktrace: *********** There is a stacktrace in comments for https://issues.redhat.com/browse/ENTMQCL-1699. I was not able to get a corefile now, when reproducing the issue in docker. The core file is not created and coredump ctl does not report any cores.
When searching bugzilla, I found two similar issues, neither seems to be a duplicate, or provide to me any hints. https://bugzilla.redhat.com/show_bug.cgi?id=767094 https://bugzilla.redhat.com/show_bug.cgi?id=104173
For the record, I see this in Fedora Rawhide container as well with Python 3.6 as well as Python 3.9 but not with Python 3.10.
https://bugs.python.org/issue18748 might be relevant
(In reply to Miro Hrončok from comment #2) > For the record, I see this in Fedora Rawhide container as well with Python > 3.6 as well as Python 3.9 but not with Python 3.10. Actually, I was piping the output to `more` and when I don't do that, I cannot reproduce this with Python 3.9. I can reproduce this on Fedora Rawhide with Python 3.6 and 3.7, but not in 3.8+. That kinda supports the idea that this was fixed via https://bugs.python.org/issue18748
> This issue manifests both on RHEL 7 and RHEL 8 with Python version 3.6. It does not manifest with Python version 2.7 (on RHEL 7). Oh right, in Python 2.7, _thread.start_new_thread() doesn't call pthread_cancel() at the thread exit. It does in Python 3.6. The pthread_cancel() call is redundant and can be removed. Removing the call fixes this race condition. I proposed exactly that in Python upstream: * https://bugs.python.org/issue44434 * https://github.com/python/cpython/pull/26758 "How reproducible: Intermittently, but happens fairly frequently using the attached reproducer. About 10 attempts at running the reproduction steps should be sufficient to reproduce." Right, even if the file descriptor limit is very low (5), it remains hard to trigger the issue with 1000 threads. The race condition is hard to trigger. I attached 2 different reproducer scripts to https://bugs.python.org/issue44434 which make the race condition more likely. It seems like sometimes the libgcc_s library is loaded early during Python startup. Sometimes, it only loaded when the first thread exits. Sometimes, it goes fine. Sometimes, I get the abort() call with error message. "Workaround: export LD_PRELOAD=/usr/lib64/libgcc_s.so.1" Another is to use a larger file descriptor limit, but it only makes the race condition less likely, it doesn't fully fix it.
Ok, the issue is now fixed in Python upstream in 3.9, 3.10 and main branches: https://bugs.python.org/issue44434 > This issue manifests both on RHEL 7 and RHEL 8 with Python version 3.6. It does not manifest with Python version 2.7 (on RHEL 7). Jiri Danek: Do you need a backport to Python 3.6 of RHEL7 and RHEL8, or is the "export LD_PRELOAD=/usr/lib64/libgcc_s.so.1" workaround acceptable for your use case?
> Jiri Danek: Do you need a backport to Python 3.6 of RHEL7 and RHEL8 [...]? TBH, I don't know. We only hit this issue during testing, it does not have an associated customer case. For testing the EMFILE error handling in Qpid Proton Python library, I feel that `export LD_PRELOAD=/usr/lib64/libgcc_s.so.1` workaround is perfectly satisfactory; now that we understand what's actually happening. Whether there is sufficient value in fixing the CPython interpreter itself, I can't tell. Proton in general was never all that good at handling resource exhaustion cases and given that prior experience, no-one really expects it to excel in this area. Meaning this sort of resiliency is not a crucial feature of the product. I will ask around the team and I will update here.
We discussed this on AMQ Clients project meeting. We think this issue should be fixed as part of regular RHEL bugfix erratas since it potentially affects all Python 3.6 users.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: python3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:4399