Bug 698726 - Crash on ppc64 within "--with-tsc" support (seen in python-debug during build)
Summary: Crash on ppc64 within "--with-tsc" support (seen in python-debug during build)
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Fedora
Classification: Fedora
Component: python
Version: 16
Hardware: powerpc
OS: Unspecified
high
high
Target Milestone: ---
Assignee: Dave Malcolm
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks: F16Betappc
TreeView+ depends on / blocked
 
Reported: 2011-04-21 15:41 UTC by Karsten Hopp
Modified: 2012-03-14 13:10 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2011-10-27 15:55:55 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)
gdb backtrace (22.96 KB, text/plain)
2011-04-21 15:41 UTC, Karsten Hopp
no flags Details
Candidate patch to fix --with-tsc on ppc64, and to fix aliasing violations on 32-bit ppc (1.72 KB, patch)
2011-08-23 19:51 UTC, Dave Malcolm
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Python 12872 0 None None None Never

Description Karsten Hopp 2011-04-21 15:41:49 UTC
Created attachment 493900 [details]
gdb backtrace

Description of problem:

gcc -fno-strict-aliasing -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64 -mminimal-toc -D_GNU_SOURCE -fPIC -fwrapv   -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64 -mminimal-toc -D_GNU_SOURCE -fPIC -fwrapv -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64 -mminimal-toc -D_GNU_SOURCE -fPIC -fwrapv    -Xlinker -export-dynamic -o python-debug \
		Modules/python.o \
		-L. -lpython2.7_d -lpthread -ldl  -lutil   -lm  
/bin/sh: line 1: 21015 Segmentation fault      (core dumped) LD_LIBRARY_PATH=/builddir/build/BUILD/Python-2.7.1/build/debug: CC='gcc -pthread' LDSHARED='gcc -pthread -shared ' OPT='-O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64 -mminimal-toc -D_GNU_SOURCE -fPIC -fwrapv' ./python-debug -E /builddir/build/BUILD/Python-2.7.1/setup.py -q build
RPM build errors:


Version-Release number of selected component (if applicable):
python-2.7.1-6.fc15

Comment 1 Karsten Hopp 2011-04-21 15:42:38 UTC
full logs at https://ppc.koji.fedoraproject.org/koji/taskinfo?taskID=197370

Comment 2 Karsten Hopp 2011-07-05 15:50:46 UTC
a similar problem still exists in python-2.7.2-4.fc16 on ppc64:

/builddir/build/BUILD/Python-2.7.2/Modules/posixmodule.c:7317: warning: the use of `tempnam' is dangerous, better use `mkstemp'
gcc -fno-strict-aliasing -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -Wl,-z,relro -m64 -mminimal-toc -D_GNU_SOURCE -fPIC -fwrapv   -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -Wl,-z,relro -m64 -mminimal-toc -D_GNU_SOURCE -fPIC -fwrapv -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -Wl,-z,relro -m64 -mminimal-toc -D_GNU_SOURCE -fPIC -fwrapv    -Xlinker -export-dynamic -o python-debug \
		Modules/python.o \
		-L. -lpython2.7_d -lpthread -ldl  -lutil   -lm  
/bin/sh: line 1:  3599 Segmentation fault      (core dumped) LD_LIBRARY_PATH=/builddir/build/BUILD/Python-2.7.2/build/debug: CC='gcc -pthread' LDSHARED='gcc -pthread -shared ' OPT='-O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -Wl,-z,relro -m64 -mminimal-toc -D_GNU_SOURCE -fPIC -fwrapv' ./python-debug -E /builddir/build/BUILD/Python-2.7.2/setup.py -q build
make: *** [sharedmods] Error 139

full logs at http://ppc.koji.fedoraproject.org/koji/taskinfo?taskID=245909

@Jakub:
The latest successfull python build was python-2.7.1-4.fc15, python-2.7.1-5.fc15 already failed with this python.debug segmentation fault.
The difference is that -4 built with gcc-4.5.1-6 and -5 with gcc-4.6.0-0.12
I wonder if gcc is acting up here

Comment 3 Jakub Jelinek 2011-07-05 18:50:42 UTC
It is much more probable it is an application bug (relying on undefined behavior etc.) than a gcc bug, though of course that can't be ruled out.
If you suspect a miscompilation, let somebody familiar with the source code first try to see if some gcc flags makes it working again (e.g. if compiling with -O0 makes it work, or -O2 -fno-strict-aliasing, etc.).  If yes, try to do a binary search between objects compiled with the working options and non-working to narrow the problem to one source file.  If not, do a similar binary search between older compiler compiled objects and new compiler compiled objects.
Once you know which source file is problematic, first compile it with -W -Wall, look at all the warnings, see if some of them might not show up problem in the code, if not, try to narrow down the problem to a particular source file (and see if it can be reproduced even with -fno-inline, that helps to narrow it down to a function), then try to create a self-contained testcase calling that function with the right arguments and abort or somehow else signal if that function misbehaves.

Comment 4 Dave Malcolm 2011-07-05 19:13:06 UTC
Here's the start of the definition of that function:

static PyObject *
call_function(PyObject ***pp_stack, int oparg
#ifdef WITH_TSC
                , uint64* pintr0, uint64* pintr1
#endif
                )

It could be that the WITH_TSC is confused, perhaps, but there is this forwards-declaration:
#ifdef WITH_TSC
static PyObject * call_function(PyObject ***, int, uint64*, uint64*);
#else
static PyObject * call_function(PyObject ***, int);
#endif

and it appears to be used consistently throughout.

I don't know if this is significant, but frame #0 in that backtrace is reported with the arguments in reverse order to those of the declaration.

Here's the frame from the backtrace:

#0  call_function (pintr1=0xfffcf8b6be0, pintr0=0xfffcf8b6bd8, oparg=<optimized out>, pp_stack=0xfffcf8b6be8)
    at /builddir/build/BUILD/Python-2.7.1/Python/ceval.c:4105

Every other frame appears to be reported with the arguments in the same order as in the declaration; this one is reported in reverse order.

Comment 8 Dave Malcolm 2011-08-23 16:41:37 UTC
If I disable line 56 within:
  ppc_getcounter(uint64 *v)
in Python/ceval.c, then the problem goes away:
    32  typedef unsigned long long uint64;
    33  
    34  /* PowerPC support.
    35     "__ppc__" appears to be the preprocessor definition to detect on OS X, whereas
    36     "__powerpc__" appears to be the correct one for Linux with GCC
    37  */
    38  #if defined(__ppc__) || defined (__powerpc__)
    39  
    40  #define READ_TIMESTAMP(var) ppc_getcounter(&var)
    41  
    42  static void
    43  ppc_getcounter(uint64 *v)
    44  {
    45      register unsigned long tbu, tb, tbu2;
    46  
    47    loop:
    48      asm volatile ("mftbu %0" : "=r" (tbu) );
    49      asm volatile ("mftb  %0" : "=r" (tb)  );
    50      asm volatile ("mftbu %0" : "=r" (tbu2));
    51      if (__builtin_expect(tbu != tbu2, 0)) goto loop;
    52  
    53      /* The slightly peculiar way of writing the next lines is
    54         compiled better by GCC than any other way I tried. */
    55      ((long*)(v))[0] = tbu;
    56      /*((long*)(v))[1] = tb; */ /* <==== this is the bug */
    57  }
    58  
    59  #elif defined(__i386__)

(gdb) p sizeof(long)
$44 = 8
(gdb) p sizeof(uint64)
$45 = 8

Looks like lines 55 and 56 are erroneously assuming that a long is 4 bytes on this arch: line 56 above is trashing the next value on the machine's stack.

The code has been this way since ppc_getcounter was added, in:
  http://hg.python.org/cpython/rev/f455bbe7ea7e

I may have broken this in:
  http://hg.python.org/cpython/rev/419ca089d365/
which was for:
  http://bugs.python.org/issue10655
by (perhaps) generalizing support from ppc to (ppc and ppc64) (not sure about this).

Comment 10 Dave Malcolm 2011-08-23 18:27:28 UTC
Workaround for now is to stop using "--with-tsc" when configure the debug build on ppc64

Fix committed to "python" in rawhide (for f17):
http://pkgs.fedoraproject.org/gitweb/?p=python.git;a=commitdiff;h=76e85fb7737abc82d729292607f9e2759645e29c
  Building python-2.7.2-6.fc17 for dist-rawhide
  Task info: http://koji.fedoraproject.org/koji/taskinfo?taskID=3296206

Fix committed to "python3" in rawhide (for f17):
http://pkgs.fedoraproject.org/gitweb/?p=python3.git;a=commitdiff;h=4763ff864f559286fdcf5090d30db55311119ecb
  Building python3-3.2.1-4.fc17 for dist-rawhide
  Task info: http://koji.fedoraproject.org/koji/taskinfo?taskID=3296208

Comment 13 Dave Malcolm 2011-08-23 19:51:32 UTC
Created attachment 519514 [details]
Candidate patch to fix --with-tsc on ppc64, and to fix aliasing violations on 32-bit ppc

Tested and seems to work on ppc64; am about to test on 32-bit ppc

Comment 14 Dave Malcolm 2011-08-23 20:55:41 UTC
I've applied the patch from attachment #519514 [details] to both python and python3 in rawhide, and re-enabled --with-tsc on ppc64 for the debug build; rebuilding both now:

python:
http://pkgs.fedoraproject.org/gitweb/?p=python.git;a=commitdiff;h=92ed49e1f9a286b6ee791a29f6b25be191d0c4c5
  Building python-2.7.2-7.fc17 for dist-rawhide
  Task info: http://koji.fedoraproject.org/koji/taskinfo?taskID=3296752

python3:
http://pkgs.fedoraproject.org/gitweb/?p=python3.git;a=commitdiff;h=ceb359a69b285160f7997c0b77de1dfd3567e80e
  Building python3-3.2.1-5.fc17 for dist-rawhide
  Task info: http://koji.fedoraproject.org/koji/taskinfo?taskID=3296765

Comment 15 Dave Malcolm 2011-08-23 21:09:34 UTC
For Fedora 16, let's simply disable --with-tsc on ppc64 debug

python (f16)
http://pkgs.fedoraproject.org/gitweb/?p=python.git;a=commitdiff;h=0be4d5a7fc2fbfd4e558e7143abef04cb580c4b9
  Building python-2.7.2-4.1.fc16 for f16-candidate
  Task info: http://koji.fedoraproject.org/koji/taskinfo?taskID=3296788

python3 (f16)
http://pkgs.fedoraproject.org/gitweb/?p=python3.git;a=commitdiff;h=0c0fcb4642f6d6b385b95231d868d676081e6299
  Building python3-3.2.1-2.1.fc16 for f16-candidate
  Task info: http://koji.fedoraproject.org/koji/taskinfo?taskID=3296797

Comment 16 Dave Malcolm 2011-08-23 21:13:11 UTC
Test case:
  enable tscdump, and run some bytecodes (e.g. by "import logging")

$ python-debug -c "import sys; sys.settscdump(True); import logging"
$ python3-debug -c "import sys; sys.settscdump(True); import logging"

Notes on --with-tsc:
  http://hg.python.org/cpython/file/f455bbe7ea7e/Misc/SpecialBuilds.txt

Comment 18 Dave Malcolm 2011-08-23 21:57:56 UTC
[All of the builds succeeded]

Do I need to do a Bodhi update to F16 to pull the fix in for ppc64, or is this unneeded?

Comment 19 Karsten Hopp 2011-08-24 12:25:50 UTC
no, a bodhi update is not needed, I can have a different n-v-r on PPC than on the primary archs, although I try to avoid it if possible. The only requirements are that the patch is in git so that the next python update will work out of the box on PPC and that the next n-v-r on the primary archs is higher than what we have on PPC. Both requirements are met and I can just pull in the new package.

Unfortunately python and python3 still fail to build in koji, although the builds progressed beyond the secfault issue.
I've opened bugzilla 732998 to track the new problem

Comment 20 Dave Malcolm 2011-08-31 21:52:41 UTC
--with-tsc bug and patch reported upstream as http://bugs.python.org/issue12872


Note You need to log in before you can comment on or make changes to this bug.