Description of problem: When attempting to build python-sqlalchemy-1.4.43 on Rawhide (f38), builds fail randomly between architectures because xdist worker processes crash. Version-Release number of selected component (if applicable): python3-pytest-7.1.3-1.fc38 python3-pytest-xdist-3.0.2-2.fc38 How reproducible: 9 times out of 9 ;) Steps to Reproduce: 1. koji build --scratch f38 'git+https://src.fedoraproject.org/rpms/python-sqlalchemy.git#03b848d2a7d9a27da65451e5d34af70396fb0f7c' Actual results: One or more arch builds fail because xdist workers crash in tests. The set of failing arches varies between attempted builds, no one arch fails always (but x86_64 fails most of the time). Expected results: Arch builds succeed (like the same NVR on f35-37, epel9 -- or the one on f38 in which I disabled parallel tests). Additional info: A couple of attempted builds and their failing architectures: - https://koji.fedoraproject.org/koji/taskinfo?taskID=93952008 (i686, x86_64, s390x) - https://koji.fedoraproject.org/koji/taskinfo?taskID=93952430 (i686, x86_64) - https://koji.fedoraproject.org/koji/taskinfo?taskID=93951527 (i686, x86_64, ppc64le) - https://koji.fedoraproject.org/koji/taskinfo?taskID=93950665 (x86_64) - https://koji.fedoraproject.org/koji/taskinfo?taskID=93949460 (i686, x86_64) - https://koji.fedoraproject.org/koji/taskinfo?taskID=93948951 (i686, s390x) - https://koji.fedoraproject.org/koji/taskinfo?taskID=93950092 (i686, x86_64, aarch64) - https://koji.fedoraproject.org/koji/taskinfo?taskID=93948565 (x86_64) - https://koji.fedoraproject.org/koji/taskinfo?taskID=93948123 (x86_64) - https://koji.fedoraproject.org/koji/taskinfo?taskID=93969488 (i686, a scratch build) The successful build on f38 with parallel tests disabled: https://koji.fedoraproject.org/koji/buildinfo?buildID=2086144
It seems somewhat unlikely to me that this is a pytest-xdist bug, given that pytest-xdist is pure Python. Does this reproduce locally in mock? Are you able to get a core dump?
(In reply to Scott Talbert from comment #1) > It seems somewhat unlikely to me that this is a pytest-xdist bug, given that > pytest-xdist is pure Python. Something related to how pytest-xdist runs the tests seems to trigger this. When I run tests without `-n auto`, I can’t reproduce the issue. > Does this reproduce locally in mock? Are you able to get a core dump? Hmm, didn’t notice that the processes in question actually segfaulted… I just did a local mock build on x86_64, but it failed to reproduce the problem here. @mbayer, SQLAlchemy has some small C extensions but I don’t see how they could be the culprit here – pytest-xdist runs separate processes for its workers… Did you get any similar reports upstream?
(In reply to Nils Philippsen from comment #2) > (In reply to Scott Talbert from comment #1) > > It seems somewhat unlikely to me that this is a pytest-xdist bug, given that > > pytest-xdist is pure Python. > > Something related to how pytest-xdist runs the tests seems to trigger this. > When I run tests without `-n auto`, I can’t reproduce the issue. > > > Does this reproduce locally in mock? Are you able to get a core dump? > > Hmm, didn’t notice that the processes in question actually segfaulted… I > just did a local mock build on x86_64, but it failed to reproduce the > problem here. > > @mbayer, SQLAlchemy has some small C extensions but I don’t see > how they could be the culprit here – pytest-xdist runs separate processes > for its workers… Did you get any similar reports upstream? Yep. It looks like the segfaults are happening during garbage collection, so my suspicion would be in some of the C extension cleanup code. I wonder if releng could get us access to some of the core dumps?
this issue has not been seen before and we do run with py3.11 in our test suite. one possible culprit is the old version of greenlet in use, which was never built for Python 3.11 and has a significant py3.11 memory leak prior to 2.0.1, is the issue: python3-greenlet x86_64 1.1.3-1.fc38 build 118 k Greenlet has an extreme memory leak under Python 3.11 only unless you update to 2.0.1 just released this week. however, that memory leak is when you make lots of greenlets, and the test run here is likely making minimal use of greenlets as there are no async db drivers in the run. I can confirm it will run the test suite itself inside of a single greenlet but we've observed no issues with that kind of thing. so one thing we can try is to disable asyncio at all. There's a "--disable-asyncio" flag for the test run but it appears to be non-working at the moment, and even if i fix the small problem this parameter has, does not eliminate all greenlet use. the way to guarantee nothing with greenlet happens is to not have greenlet installed in the environment at all. the test suite for 1.4.x should be able to run in its entirety without greenlet installed. Or, you can try to get greenlet 2.0.1 installed in the environment which has fixed the memory leak issue. beyond that, we would need to identify what has changed for this build: 1. was the issue observed with SQLAlchemy 1.4.42? I assume not a. was the Python version 3.11.0 the same? b. was the greenlet version 1.1.3 the same? c. was the version of pytest and pytest-xdist the same? d. was the version of sqlite3 / sqlite3-devel native libraries the same? overall nothing much has changed in SQLAlhcemy 1.4.43 vs. 42 and certainly nothing in the area of C extensions. so we would need to look at other factors which have changed that introduced this problem.
a patch that will repair the --disable-asyncio parameter, if you want to include that, is at https://gerrit.sqlalchemy.org/c/sqlalchemy/sqlalchemy/+/4191
note also that Py3.11 has its own pretty serious issues with concurrency, I doubt xdist is spinning up lots of threads but on the py3.11 side I've also identified this leak: https://github.com/python/cpython/issues/99205 that's in all py3.11 versions
This bug appears to have been reported against 'rawhide' during the Fedora Linux 38 development cycle. Changing version to 38.
Closing this as it doesn't seem to be a bug in pytest-xdist.