Bug 1561170

Summary: Segfault while creating thread state
Product: Red Hat Enterprise Linux 7 Reporter: Lon Hohberger <lhh>
Component: python-cffiAssignee: Christian Heimes <cheimes>
Status: CLOSED WONTFIX QA Contact: ipa-qe <ipa-qe>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 7.5CC: apevec, cheimes, cstratak, hhorak, jjoyce, jschluet, lhh, mkosek, nsoffer, pviktori, python-maint, salmy, vstinner
Target Milestone: rcKeywords: Regression
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-03-14 15:27:34 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
CFFI thread test case 3 (tarball w/ code & makefile) none

Description Lon Hohberger 2018-03-27 19:51:12 UTC
Description of problem:

The python-cffi package 1.11.2 has a test which spawns 20 threads at once, doing a bunch of adds. Nothing too serious.

The problem seems to be that, when spawning 20 threads at once, Python itself crashes:

#0  new_threadstate (interp=0x0, init=init@entry=1) at /usr/src/debug/Python-2.7.5/Python/pystate.c:202
#1  0x00007ffff78f2c1a in PyThreadState_New (interp=<optimized out>) at /usr/src/debug/Python-2.7.5/Python/pystate.c:213
#2  0x00007ffff78f347c in PyGILState_Ensure () at /usr/src/debug/Python-2.7.5/Python/pystate.c:600
#3  0x0000000000401b8b in _cffi_initialize_python () at _add3_cffi.c:815
#4  0x0000000000402035 in _cffi_start_python () at _add3_cffi.c:1098
#5  0x00000000004020a9 in _cffi_start_and_call_python (externpy=0x604200 <_cffi_externpy__add3>, 
    args=0x7ffff651aeb0 "\350\003") at _add3_cffi.c:1138
#6  0x00000000004021a0 in add3 (a0=1000, a1=200, a2=30, a3=4) at _add3_cffi.c:1206
#7  0x000000000040174c in start_routine_3 (arg=0x0) at thread3-test.c:26
#8  0x00007ffff7bc6dd5 in start_thread () from /lib64/libpthread.so.0
#9  0x00007ffff7524b3d in clone () from /lib64/libc.so.6

Version-Release number of selected component (if applicable): python-2.7.5-68.el7

How reproducible: 50%

Steps to Reproduce:
1. Run checks for python-cffi-1.11.2

Actual results: Segmentation fault.

Additional info:

Inspecting the stack is peculiar:

At frame 0:

interp = NULL, causing segfault on line 202 in pystate.c

At frame 1: 

#1  0x00007ffff78f2c1a in PyThreadState_New (interp=<optimized out>) at /usr/src/debug/Python-2.7.5/Python/pystate.c:213
213	    return new_threadstate(interp, 1);

At frame 2:

#2  0x00007ffff78f347c in PyGILState_Ensure () at /usr/src/debug/Python-2.7.5/Python/pystate.c:600
600	        tcur = PyThreadState_New(autoInterpreterState);
(gdb) p autoInterpreterState
$16 = (PyInterpreterState *) 0x7ffff00008f0


It looks like something's wiping it out on the stack (compiler optimizing too much?).  Placing explicit thread locks around bits to serialize calls into PyGILState_Ensure() to ensure threads didn't enter that code simultaneously didn't seem to help at all.

The same code works correctly on F26 with 2.7.14-7.fc26.

I'm willing to patch up python-cffi as a workaround, but the code isn't terribly familiar yet.

Comment 2 Lon Hohberger 2018-03-27 20:07:00 UTC
Note: Downgrading the mock environment from python-2.7.5-68 to 2.7.5-58 causes everything to work correctly.

Comment 9 Lon Hohberger 2018-03-29 19:38:40 UTC
Created attachment 1414884 [details]
CFFI thread test case 3 (tarball w/ code & makefile)

This is one of the failing tests.  Downgrading to python-2.7.5-58.el7 (or running this on F26/F27 builds) works.

Comment 11 Petr Viktorin (pviktori) 2018-04-11 10:12:15 UTC
Lon, did you get a moment to test the builds?

Comment 17 Lon Hohberger 2018-05-02 18:23:50 UTC
Also my initial comment said that it worked on F26 with python2-2.7.14-7.fc26; that's inaccurate, it crashes there.

On F28, the code also fails with python2-2.7.14-15.fc28.

Comment 18 Charalampos Stratakis 2018-05-03 10:06:24 UTC
(In reply to Lon Hohberger from comment #17)
> Also my initial comment said that it worked on F26 with
> python2-2.7.14-7.fc26; that's inaccurate, it crashes there.
> 
> On F28, the code also fails with python2-2.7.14-15.fc28.

It will fail there as well, since the fix included in the RHEL build is also included in Fedora.

cc'ing the author of the patch.

Comment 19 Charalampos Stratakis 2018-05-03 10:43:41 UTC
Also here is the commit causing the issue: https://github.com/python/cpython/commit/830daae1c82ed33deef0086b7b6323e5be0b0cc8

Comment 20 Nir Soffer 2018-05-03 11:22:03 UTC
(In reply to Charalampos Stratakis from comment #19)
> Also here is the commit causing the issue:
> https://github.com/python/cpython/commit/
> 830daae1c82ed33deef0086b7b6323e5be0b0cc8

Hi Charalampos, why do you think that releasing the GIL in fileio is related to 
cffi?

Comment 21 Charalampos Stratakis 2018-05-03 11:27:15 UTC
(In reply to Nir Soffer from comment #20)
> (In reply to Charalampos Stratakis from comment #19)
> > Also here is the commit causing the issue:
> > https://github.com/python/cpython/commit/
> > 830daae1c82ed33deef0086b7b6323e5be0b0cc8
> 
> Hi Charalampos, why do you think that releasing the GIL in fileio is related
> to 
> cffi?

I haven't dived into the code of this patch, however from testing of the aforementioned builds, it suggests that the test fails with that patch applied.

Comment 22 Nir Soffer 2018-05-03 11:41:35 UTC
I looked briefly in the attached test case, it is very complex code, so there is 
good chance that the test code is doing something wrong, and the fix for fileio
just expose the issue.

- Is this upstream test from cffi package?
- Does this happen with upstream python?

Comment 23 Lon Hohberger 2018-05-29 19:41:20 UTC
It's the test code lifted directly from cffi 1.11.2 test runs.

The code is somewhat complicated since it's generated code.

I did write the Makefile, however, to make it easier to run (e.g. so one doesn't have to build all of CFFI).

It also fails in the same location against Python 2.7.15, built from source. So, my expectation is that portions of CFFI will no longer work with 2.7.15:

Thread 3 "thread3-test" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff68c4700 (LWP 18472)]
new_threadstate (init=1, interp=0x0) at Python/pystate.c:202
202	        tstate->next = interp->tstate_head;
(gdb) bt
#0  new_threadstate (init=1, interp=0x0) at Python/pystate.c:202
#1  PyThreadState_New (interp=0x0) at Python/pystate.c:213
#2  PyGILState_Ensure () at Python/pystate.c:604
#3  0x0000000000406336 in _cffi_initialize_python ()
#4  0x00000000004067e3 in _cffi_start_python ()
#5  0x0000000000406857 in _cffi_start_and_call_python ()
#6  0x0000000000406952 in add3 ()
#7  0x0000000000406b34 in start_routine_3 ()
#8  0x00007ffff7bbf36d in start_thread () from /lib64/libpthread.so.0
#9  0x00007ffff71d6b4f in clone () from /lib64/libc.so.6
(gdb) up
#1  PyThreadState_New (interp=0x0) at Python/pystate.c:213
213	    return new_threadstate(interp, 1);
(gdb) list
208	}
209	
210	PyThreadState *
211	PyThreadState_New(PyInterpreterState *interp)
212	{
213	    return new_threadstate(interp, 1);
214	}
215	
216	PyThreadState *
217	_PyThreadState_Prealloc(PyInterpreterState *interp)
(gdb) p interp
$1 = (PyInterpreterState *) 0x0
(gdb) up
#2  PyGILState_Ensure () at Python/pystate.c:604
604	        tcur = PyThreadState_New(autoInterpreterState);
(gdb) p autoInterpreterState
$2 = (PyInterpreterState *) 0x7ffff00008f0

Comment 24 Charalampos Stratakis 2018-06-06 11:53:45 UTC
(In reply to Lon Hohberger from comment #23)
> It's the test code lifted directly from cffi 1.11.2 test runs.
> 
> The code is somewhat complicated since it's generated code.
> 
> I did write the Makefile, however, to make it easier to run (e.g. so one
> doesn't have to build all of CFFI).
> 
> It also fails in the same location against Python 2.7.15, built from source.
> So, my expectation is that portions of CFFI will no longer work with 2.7.15:
> 
> Thread 3 "thread3-test" received signal SIGSEGV, Segmentation fault.
> [Switching to Thread 0x7ffff68c4700 (LWP 18472)]
> new_threadstate (init=1, interp=0x0) at Python/pystate.c:202
> 202	        tstate->next = interp->tstate_head;
> (gdb) bt
> #0  new_threadstate (init=1, interp=0x0) at Python/pystate.c:202
> #1  PyThreadState_New (interp=0x0) at Python/pystate.c:213
> #2  PyGILState_Ensure () at Python/pystate.c:604
> #3  0x0000000000406336 in _cffi_initialize_python ()
> #4  0x00000000004067e3 in _cffi_start_python ()
> #5  0x0000000000406857 in _cffi_start_and_call_python ()
> #6  0x0000000000406952 in add3 ()
> #7  0x0000000000406b34 in start_routine_3 ()
> #8  0x00007ffff7bbf36d in start_thread () from /lib64/libpthread.so.0
> #9  0x00007ffff71d6b4f in clone () from /lib64/libc.so.6
> (gdb) up
> #1  PyThreadState_New (interp=0x0) at Python/pystate.c:213
> 213	    return new_threadstate(interp, 1);
> (gdb) list
> 208	}
> 209	
> 210	PyThreadState *
> 211	PyThreadState_New(PyInterpreterState *interp)
> 212	{
> 213	    return new_threadstate(interp, 1);
> 214	}
> 215	
> 216	PyThreadState *
> 217	_PyThreadState_Prealloc(PyInterpreterState *interp)
> (gdb) p interp
> $1 = (PyInterpreterState *) 0x0
> (gdb) up
> #2  PyGILState_Ensure () at Python/pystate.c:604
> 604	        tcur = PyThreadState_New(autoInterpreterState);
> (gdb) p autoInterpreterState
> $2 = (PyInterpreterState *) 0x7ffff00008f0

Ideally this should be resolved upstream first. Could you create an issue at bugs.python.org ?

Comment 25 Petr Viktorin (pviktori) 2018-06-07 14:27:55 UTC
Victor, if you get some time, could you investigate this one?

Comment 26 Victor Stinner 2018-06-13 16:12:12 UTC
I'm not sure that the bug is related, but Python 2.7.15 contains the following fix:
* "PyGILState_Ensure on non-Python thread causes fatal error"
* https://bugs.python.org/issue20891
* https://github.com/python/cpython/commit/be6b74c0795b709c7a04e2187a7e32d08f5155f6

Comment 27 Victor Stinner 2018-06-15 10:16:02 UTC
Ok, I'm able to reproduce the bug on Fedora 28 using attached "CFFI thread test case 3".

The bug is a race condition in _cffi_initialize_python() of the test case. The first thread calls Py_InitializeEx(), but then other threads call PyGILState_Ensure(). If PyGILState_Ensure() is called before Py_InitializeEx() completes, you get the crash.

Example of crash. Thread A is still running Py_InitializeEx() while Thread B calls PyGILState_Ensure().

Thread A:

#0  0x00007ffff77d82f0 in PyDict_GetItem (op=0x7ffff7f74060, key=0x7ffff7f65ee0) at Objects/dictobject.c:714
#1  0x00007ffff77dd5fe in PyDict_GetItemString (v=0x7ffff7f74060, key=0x7ffff790de80 "__doc__") at Objects/dictobject.c:2509
#2  0x00007ffff7819276 in PyType_Ready (type=0x7ffff7b5a8e0 <_PyExc_LookupError>) at Objects/typeobject.c:4260
#3  0x00007ffff77a89b5 in _PyExc_Init () at Objects/exceptions.c:2078
#4  0x00007ffff78bc60e in Py_InitializeEx (install_sigs=0) at Python/pythonrun.c:265
#5  0x000000000040182c in _cffi_py_initialize () at _add2_cffi.c:782
#6  0x0000000000401851 in _cffi_initialize_python () at _add2_cffi.c:805
#7  0x0000000000401e6a in _cffi_start_python () at _add2_cffi.c:1103
#8  0x0000000000401ede in _cffi_start_and_call_python (externpy=0x604200 <_cffi_externpy__add2>, args=0x7ffff6bceea0 "(") at _add2_cffi.c:1143
#9  0x0000000000401fc9 in add2 (a0=40, a1=2, a2=100) at _add2_cffi.c:1209
#10 0x0000000000402c15 in start_routine_2 (arg=0x0) at thread3-test.c:14
#11 0x00007ffff7bbf594 in start_thread () from /lib64/libpthread.so.0
#12 0x00007ffff746500f in clone () from /lib64/libc.so.6

Thread B:

#0  0x00007ffff73a1f2b in raise () from /lib64/libc.so.6
#1  0x00007ffff738c561 in abort () from /lib64/libc.so.6
#2  0x00007ffff738c431 in __assert_fail_base.cold.0 () from /lib64/libc.so.6
#3  0x00007ffff739a692 in __assert_fail () from /lib64/libc.so.6
#4  0x00007ffff78bbe7b in PyGILState_Ensure () at Python/pystate.c:600
#5  0x000000000040233b in _cffi_initialize_python () at _add3_cffi.c:815
#6  0x0000000000402940 in _cffi_start_python () at _add3_cffi.c:1098
#7  0x00000000004029b4 in _cffi_start_and_call_python (externpy=0x604280 <_cffi_externpy__add3>, args=0x7ffff63cde90 "\350\003") at _add3_cffi.c:1138
#8  0x0000000000402aaf in add3 (a0=1000, a1=200, a2=30, a3=4) at _add3_cffi.c:1206
#9  0x0000000000402c91 in start_routine_3 (arg=0x0) at thread3-test.c:26
#10 0x00007ffff7bbf594 in start_thread () from /lib64/libpthread.so.0
#11 0x00007ffff746500f in clone () from /lib64/libc.so.6

Code of _cffi_carefully_make_gil():

https://bitbucket.org/cffi/cffi/src/af3a1e8a51e01affd0542727bb9e90fc3b90dafb/cffi/_embedding.h?at=default&fileviewer=file-view-default#_embedding.h-248

Comment 28 Victor Stinner 2018-06-15 10:35:30 UTC
It seems like the bug has been fixed in cffi 1.11.5 by the following commit:
https://bitbucket.org/cffi/cffi/commits/7b46e415e208

I reassign the bug to python-cffi.

Comment 29 Christian Heimes 2018-06-15 10:51:48 UTC
Upstream bug report: https://bitbucket.org/cffi/cffi/issues/358/

Comment 33 Christian Heimes 2019-02-27 10:54:51 UTC
I'm considering to close this BZ as WONTFIX. The crash is just a test artefact. Supported production cases are not affected by the issue. RHEL 8 contains a newer version of CFFI that does not observe the crash, too.

I don't want to risk updating CFFI just to silence a test problem.

Comment 34 Martin Kosek 2019-03-14 15:27:34 UTC
As I saw no concerns with Christian's plan and I agree with him, I am closing the bug.