Bug 2141244 - worker processes crashing randomly
Summary: worker processes crashing randomly
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Fedora
Classification: Fedora
Component: python-pytest-xdist
Version: 38
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Scott Talbert
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-11-09 09:42 UTC by Nils Philippsen
Modified: 2023-07-18 19:18 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-07-18 19:18:11 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)

Description Nils Philippsen 2022-11-09 09:42:59 UTC
Description of problem:

When attempting to build python-sqlalchemy-1.4.43 on Rawhide (f38), builds fail randomly between architectures because xdist worker processes crash.

Version-Release number of selected component (if applicable):
python3-pytest-7.1.3-1.fc38
python3-pytest-xdist-3.0.2-2.fc38

How reproducible:
9 times out of 9 ;)


Steps to Reproduce:
1. koji build --scratch f38 'git+https://src.fedoraproject.org/rpms/python-sqlalchemy.git#03b848d2a7d9a27da65451e5d34af70396fb0f7c'

Actual results:
One or more arch builds fail because xdist workers crash in tests. The set of failing arches varies between attempted builds, no one arch fails always (but x86_64 fails most of the time).

Expected results:
Arch builds succeed (like the same NVR on f35-37, epel9 -- or the one on f38 in which I disabled parallel tests).

Additional info:

A couple of attempted builds and their failing architectures:

- https://koji.fedoraproject.org/koji/taskinfo?taskID=93952008 (i686, x86_64, s390x)
- https://koji.fedoraproject.org/koji/taskinfo?taskID=93952430 (i686, x86_64)
- https://koji.fedoraproject.org/koji/taskinfo?taskID=93951527 (i686, x86_64, ppc64le)
- https://koji.fedoraproject.org/koji/taskinfo?taskID=93950665 (x86_64)
- https://koji.fedoraproject.org/koji/taskinfo?taskID=93949460 (i686, x86_64)
- https://koji.fedoraproject.org/koji/taskinfo?taskID=93948951 (i686, s390x)
- https://koji.fedoraproject.org/koji/taskinfo?taskID=93950092 (i686, x86_64, aarch64)
- https://koji.fedoraproject.org/koji/taskinfo?taskID=93948565 (x86_64)
- https://koji.fedoraproject.org/koji/taskinfo?taskID=93948123 (x86_64)
- https://koji.fedoraproject.org/koji/taskinfo?taskID=93969488 (i686, a scratch build)

The successful build on f38 with parallel tests disabled: https://koji.fedoraproject.org/koji/buildinfo?buildID=2086144

Comment 1 Scott Talbert 2022-11-09 14:17:20 UTC
It seems somewhat unlikely to me that this is a pytest-xdist bug, given that pytest-xdist is pure Python.

Does this reproduce locally in mock?  Are you able to get a core dump?

Comment 2 Nils Philippsen 2022-11-09 17:28:26 UTC
(In reply to Scott Talbert from comment #1)
> It seems somewhat unlikely to me that this is a pytest-xdist bug, given that
> pytest-xdist is pure Python.

Something related to how pytest-xdist runs the tests seems to trigger this. When I run tests without `-n auto`, I can’t reproduce the issue.

> Does this reproduce locally in mock?  Are you able to get a core dump?

Hmm, didn’t notice that the processes in question actually segfaulted… I just did a local mock build on x86_64, but it failed to reproduce the problem here.

@mbayer, SQLAlchemy has some small C extensions but I don’t see how they could be the culprit here – pytest-xdist runs separate processes for its workers… Did you get any similar reports upstream?

Comment 3 Scott Talbert 2022-11-09 20:55:10 UTC
(In reply to Nils Philippsen from comment #2)
> (In reply to Scott Talbert from comment #1)
> > It seems somewhat unlikely to me that this is a pytest-xdist bug, given that
> > pytest-xdist is pure Python.
> 
> Something related to how pytest-xdist runs the tests seems to trigger this.
> When I run tests without `-n auto`, I can’t reproduce the issue.
> 
> > Does this reproduce locally in mock?  Are you able to get a core dump?
> 
> Hmm, didn’t notice that the processes in question actually segfaulted… I
> just did a local mock build on x86_64, but it failed to reproduce the
> problem here.
> 
> @mbayer, SQLAlchemy has some small C extensions but I don’t see
> how they could be the culprit here – pytest-xdist runs separate processes
> for its workers… Did you get any similar reports upstream?

Yep.  It looks like the segfaults are happening during garbage collection, so my suspicion would be in some of the C extension cleanup code.

I wonder if releng could get us access to some of the core dumps?

Comment 4 Michael Bayer 2022-11-11 14:38:18 UTC
this issue has not been seen before and we do run with py3.11 in our test suite.

one possible culprit is the old version of greenlet in use, which was never built for Python 3.11 and has a significant py3.11 memory leak prior to 2.0.1, is the issue:

python3-greenlet           x86_64    1.1.3-1.fc38               build    118 k

Greenlet has an extreme memory leak under Python 3.11 only unless you update to 2.0.1 just released this week. 

however, that memory leak is when you make lots of greenlets, and the test run here is likely making minimal use of greenlets as there are no async db drivers in the run.   I can confirm it will run the test suite itself inside of a single greenlet but we've observed no issues with that kind of thing.

so one thing we can try is to disable asyncio at all.   There's a "--disable-asyncio" flag for the test run but it appears to be non-working at the moment, and even if i fix the small problem this parameter has, does not eliminate all greenlet use.  the way to guarantee nothing with greenlet happens is to not have greenlet installed in the environment at all.  the test suite for 1.4.x should be able to run in its entirety without greenlet installed.

Or, you can try to get greenlet 2.0.1 installed in the environment which has fixed the memory leak issue.

beyond that, we would need to identify what has changed for this build:

1. was the issue observed with SQLAlchemy 1.4.42? I assume not
   a. was the Python version 3.11.0 the same?
   b. was the greenlet version 1.1.3 the same?
   c. was the version of pytest and pytest-xdist the same? 
   d. was the version of sqlite3 / sqlite3-devel native libraries the same?


overall nothing much has changed in SQLAlhcemy 1.4.43 vs. 42 and certainly nothing in the area of C extensions.  so we would need to look at other factors which have changed that introduced this problem.

Comment 5 Michael Bayer 2022-11-11 14:52:29 UTC
a patch that will repair the --disable-asyncio parameter, if you want to include that, is at https://gerrit.sqlalchemy.org/c/sqlalchemy/sqlalchemy/+/4191

Comment 6 Michael Bayer 2022-11-11 14:56:38 UTC
 note also that Py3.11 has its own pretty serious issues with concurrency, I doubt xdist is spinning up lots of threads but on the py3.11 side I've also identified this leak: https://github.com/python/cpython/issues/99205 that's in all py3.11 versions

Comment 7 Ben Cotton 2023-02-07 14:58:35 UTC
This bug appears to have been reported against 'rawhide' during the Fedora Linux 38 development cycle.
Changing version to 38.

Comment 8 Scott Talbert 2023-07-18 19:18:11 UTC
Closing this as it doesn't seem to be a bug in pytest-xdist.


Note You need to log in before you can comment on or make changes to this bug.