Bug 627347

Summary: segfault when running python-jinja2 test suite on big-endian 64-bit architectures
Product: [Fedora] Fedora Reporter: Dan Horák <dan>
Component: pythonAssignee: Dave Malcolm <dmalcolm>
Status: CLOSED WONTFIX QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: medium Docs Contact:
Priority: low    
Version: 14CC: armin.ronacher, dmalcolm, ivazqueznet, james.antill, jonathansteffan
Target Milestone: ---   
Target Release: ---   
Hardware: s390x   
OS: Linux   
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 627398 (view as bug list) Environment:
Last Closed: 2012-08-16 14:02:03 EDT Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
Bug Depends On:    
Bug Blocks: 467765, 627398    
Description Flags
backtrace none

Description Dan Horák 2010-08-25 14:45:58 EDT
Python segfaults when running python-jinja2 test suite - see http://s390.koji.fedoraproject.org/koji/taskinfo?taskID=198496 for the logs.

Version-Release number of selected component (if applicable):

PS: I don't expect a lot of activity here as it's s390x specific, but I should be able to provide an environment to debug this issue. The main purpose of this bug is to track the issue.
Comment 1 Dave Malcolm 2010-08-25 15:32:29 EDT
Thanks for filing this; are you able to provide a backtrace? (with debuginfo installed please!)

Note that:
is a convenient way of invoking arbitrary python scripts under gdb
Comment 2 Dan Horák 2010-08-25 16:11:41 EDT
Created attachment 441039 [details]

backtrace generated with "gdb --args python ./setup.py test"
Comment 3 Dave Malcolm 2010-08-25 16:26:11 EDT
In frame #6 it's claiming RAM for a new reference-owning object, and it happens that the run-every-700 (IIRC) allocations garbage collector is invoked.

In frame #1 the garbage collector is visiting a PyTracebackObject, and it looks like one of the fields of that object is invalid (a pointer has value 0xffffffffffffff)

I'd need access to the machine to be able to debug this further.
Comment 4 Dave Malcolm 2010-08-25 17:41:41 EDT
On the machine, I was able to reproduce it within gdb:
(gdb) up
#1  0x0000020000169372 in tb_traverse (tb=0x8087ef48, visit=0x20000174c2c <visit_decref>, arg=0x0) at /usr/src/debug/Python-2.7/Python/traceback.c:36
36          Py_VISIT(tb->tb_next);
(gdb) p tb
$1 = (PyTracebackObject *) 0x8087ef48
(gdb) p *tb
$2 = {ob_refcnt = 2, ob_type = 0x20000220ef0, tb_next = 0xffffffffffffff, tb_frame = 0xff0000008095dc48, tb_lasti = 147, tb_lineno = 459}

tb->tb_next and tb->tb_frame are both invalid
tb->ob_type is correct (tp_name is "traceback"), and indeed equals &PyTraceBack_Type
(gdb) p &PyTraceBack_Type
$6 = (PyTypeObject *) 0x20000220ef0

I attempted to use a hardware watchpoint to see how the fields were manipulated.

(gdb) watch ((PyTracebackObject *) 0x8087ef48)->tb_next

and tried to restrict the watchpoint to just when the memory is a traceback object:
(gdb) condition 1 ((PyTracebackObject*)0x8087ef48)->ob_type == 0x20000220ef0
(gdb) display *((PyTracebackObject*)0x8087ef48)

This identified the problem: jinja2 is using ctypes. Specifically,
attempts to use ctypes to directly manipulate C-level fields of the tracebacks, embedding numerous assumptions about the in-memory representation of CPython's tracebacks.  There are any number of ways that this can go wrong; it may be that it's getting the 32 vs 64 bit determination wrong, or perhaps something else.

The specific backtrace of where the tb_next field become corrupted is here (it's trying to directly manipulate the reference count of the object, but misses and hits the tb_next field):

#0  l_set (ptr=0x8087ef59, value=<value optimized out>, size=<value optimized out>) at /usr/src/debug/Python-2.7/Modules/_ctypes/cfield.c:813
#1  0x00000200068fa710 in PyCData_set (dst=<_Traceback at remote 0x808faa48>, type=<_ctypes.PyCSimpleType at remote 0x808a2f68>, setfunc=0, value=-1, 
    index=2, size=8, ptr=0x8087ef59 "\377\377\377\377\377\377\377\377") at /usr/src/debug/Python-2.7/Modules/_ctypes/_ctypes.c:2936
#2  0x00000200068ffd7a in PyCField_set (self=<value optimized out>, inst=<value optimized out>, value=<value optimized out>)
    at /usr/src/debug/Python-2.7/Modules/_ctypes/cfield.c:214
#3  0x00000200000cfd62 in PyObject_GenericSetAttr (obj=<_Traceback at remote 0x808faa48>, name='ob_refcnt', value=-1)
    at /usr/src/debug/Python-2.7/Objects/object.c:1492
#4  0x00000200000cf65e in PyObject_SetAttr (v=<_Traceback at remote 0x808faa48>, name='ob_refcnt', value=-1)
    at /usr/src/debug/Python-2.7/Objects/object.c:1245
#5  0x0000020000138ea4 in PyEval_EvalFrameEx (f=
    Frame 0x808dd2f8, for file /root/fedora/python-jinja2/Jinja2-2.5.2/jinja2/debug.py, line 289, in tb_set_next (tb=<traceback at remote 0x80907908>, next=None, obj=<_Traceback at remote 0x808fa948>, old=<_Traceback at remote 0x808faa48>), throwflag=<value optimized out>)
    at /usr/src/debug/Python-2.7/Python/ceval.c:2059
#6  0x000002000013f842 in PyEval_EvalCodeEx (co=0x80888030, globals=<value optimized out>, locals=<value optimized out>, args=<value optimized out>, 
    argcount=<value optimized out>, kws=0x80890758, kwcount=0, defs=0x0, defcount=0, closure=
    (<cell at remote 0x808835e8>, <cell at remote 0x80883eb8>, <cell at remote 0x808830a8>)) at /usr/src/debug/Python-2.7/Python/ceval.c:3311
#7  0x000002000013d4cc in fast_function (f=
    Frame 0x808905b8, for file /root/fedora/python-jinja2/Jinja2-2.5.2/jinja2/debug.py, line 36, in _set_tb_next (self=<TracebackFrameProxy(tb=<traceback at remote 0x80907908>) at remote 0x808f5400>, next=None), throwflag=<value optimized out>) at /usr/src/debug/Python-2.7/Python/ceval.c:4167
#8  call_function (f=
    Frame 0x808905b8, for file /root/fedora/python-jinja2/Jinja2-2.5.2/jinja2/debug.py, line 36, in _set_tb_next (self=<TracebackFrameProxy(tb=<traceback at remote 0x80907908>) at remote 0x808f5400>, next=None), throwflag=<value optimized out>) at /usr/src/debug/Python-2.7/Python/ceval.c:4092
#9  PyEval_EvalFrameEx (f=
    Frame 0x808905b8, for file /root/fedora/python-jinja2/Jinja2-2.5.2/jinja2/debug.py, line 36, in _set_tb_next (self=<TracebackFrameProxy(tb=<traceback at remote 0x80907908>) at remote 0x808f5400>, next=None), throwflag=<value optimized out>) at /usr/src/debug/Python-2.7/Python/ceval.c:2721
Comment 5 Dave Malcolm 2010-08-25 17:47:14 EDT
within _init_ugly_crap gives this output:
  [('_ob_next', <class 'jinja2.debug.LP__PyObject'>), ('_ob_prev', <class 'jinja2.debug.LP__PyObject'>), ('ob_refcnt', <class 'ctypes.c_long'>), ('ob_type', <class 'jinja2.debug.LP__PyObject'>)]

Jinja is mistakenly deciding that this is a debug build of python, whereas it's an optimized build, which doesn't have the _ob_next or _ob_prev fields.  

Hence when Jinja attempts to manipulate ob_refcnt, it pokes at addresses that are (sizeof(PyObject*) * 2) bytes higher up in RAM than they ought to be, corrupting the later fields.
Comment 6 Dave Malcolm 2010-08-25 17:50:26 EDT
Adding some more debug prints to jinja indicates:
  object.__basicsize__: 0
  ctypes.sizeof(_PyObject): 16

hence they're non-equal, and the conditional "if object.__basicsize__ != ctypes.sizeof(_PyObject):"  is executed, and it treats it erroneously as a
build of Python.
Comment 7 Dave Malcolm 2010-08-25 18:00:38 EDT
Not yet sure why object.__basicsize__ == 0, I'd expect it to be 16.

If I change that conditional to "if False:", then the jinja selftest suite runs to completion.
Comment 8 Dave Malcolm 2010-08-25 18:19:54 EDT
Looks like an int vs Py_ssize_t issue.

struct _typeobject declares tp_basicsize and tp_itemsize as Py_ssize_t:
  Py_ssize_t tp_basicsize, tp_itemsize; /* For allocation */

but type_members defines them as T_INT:
	{"__basicsize__", T_INT, offsetof(PyTypeObject,tp_basicsize),READONLY},
	{"__itemsize__", T_INT, offsetof(PyTypeObject, tp_itemsize), READONLY},

Hence when accessing "object.__basicsize__", PyMember_GetOne reads it as a T_INT, which gets it as 0 (incorrect).  Accessing it as Py_ssize_t reads it as 16 (correct)
(gdb) p *(Py_ssize_t*)addr
$9 = 16
(gdb) p *(int*)addr
$10 = 0

Looks like a Python bug (albeit seen in what I think is a dubious use of ctypes by Jinja): I believe access of a type's __basicsize__ __itemsize__ fields will return 0 on 64-bit bigendian machines.

Looking at upstream sources, seems to still affect typeobject.c in latest upstream 2.7 still uses T_INT; and object.h still uses Py_ssize_t for these values.

Will see if I can create a minimal reproducer.

In the meantime, see comment #7 for a way to patch Jinja to build
Comment 9 Dave Malcolm 2010-08-25 18:49:53 EDT
I've filed a bug about this:
and attached a patch there.
Comment 10 Armin Ronacher 2010-08-31 07:34:42 EDT
That is an awesome bug report guys.  Major props for diving into that.

Now obviously the reason why Jinja2 is doing that is to produce proper tracebacks for generated code because Python lacks an interface for that.

I think a proper solution for redhat would be to use the switch that opts-in the c implementation of that function (new in 2.5.2, previously that switch was --with-speedups which is still available but gives a deprecation warning):

python setup.py --with-debugsupport install

I am currently trying to figure out a more reliable way to detect debug builds that does not involve basicsize.  Once I have a proper solution I send you a patch.  In any way I will do a 2.5.3 release that fixes that.
Comment 11 Dave Malcolm 2010-08-31 13:19:09 EDT
(In reply to comment #10)
> I am currently trying to figure out a more reliable way to detect debug builds
> that does not involve basicsize.  Once I have a proper solution I send you a
> patch.  In any way I will do a 2.5.3 release that fixes that.

From Python/sysmodule.c:
    #ifdef Py_TRACE_REFS
        {"getobjects",      _Py_GetObjects, METH_VARARGS},

    hasattr(sys, 'getobjects')
will only be true on CPython builds that had Py_TRACE_REFS, which IIRC is the condition you're looking for (presence of _ob_next/_ob_prev ptrs).

(Obviously the better fix is for CPython to somehow expose more of the PyTracebackObject to .py code)
Comment 12 Fedora End Of Life 2012-08-16 14:02:08 EDT
This message is a notice that Fedora 14 is now at end of life. Fedora 
has stopped maintaining and issuing updates for Fedora 14. It is 
Fedora's policy to close all bug reports from releases that are no 
longer maintained.  At this time, all open bugs with a Fedora 'version'
of '14' have been closed as WONTFIX.

(Please note: Our normal process is to give advanced warning of this 
occurring, but we forgot to do that. A thousand apologies.)

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, feel free to reopen 
this bug and simply change the 'version' to a later Fedora version.

Bug Reporter: Thank you for reporting this issue and we are sorry that 
we were unable to fix it before Fedora 14 reached end of life. If you 
would still like to see this bug fixed and are able to reproduce it 
against a later version of Fedora, you are encouraged to click on 
"Clone This Bug" (top right of this page) and open it against that 
version of Fedora.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events.  Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

The process we are following is described here: