Bug 698726 - Crash on ppc64 within "--with-tsc" support (seen in python-debug during build)
Crash on ppc64 within "--with-tsc" support (seen in python-debug during build)
Product: Fedora
Classification: Fedora
Component: python (Show other bugs)
powerpc Unspecified
high Severity high
: ---
: ---
Assigned To: Dave Malcolm
Fedora Extras Quality Assurance
Depends On:
Blocks: F16Betappc
  Show dependency treegraph
Reported: 2011-04-21 11:41 EDT by Karsten Hopp
Modified: 2012-03-14 09:10 EDT (History)
4 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2011-10-27 11:55:55 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)
gdb backtrace (22.96 KB, text/plain)
2011-04-21 11:41 EDT, Karsten Hopp
no flags Details
Candidate patch to fix --with-tsc on ppc64, and to fix aliasing violations on 32-bit ppc (1.72 KB, patch)
2011-08-23 15:51 EDT, Dave Malcolm
no flags Details | Diff

External Trackers
Tracker ID Priority Status Summary Last Updated
Python 12872 None None None Never

  None (edit)
Description Karsten Hopp 2011-04-21 11:41:49 EDT
Created attachment 493900 [details]
gdb backtrace

Description of problem:

gcc -fno-strict-aliasing -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64 -mminimal-toc -D_GNU_SOURCE -fPIC -fwrapv   -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64 -mminimal-toc -D_GNU_SOURCE -fPIC -fwrapv -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64 -mminimal-toc -D_GNU_SOURCE -fPIC -fwrapv    -Xlinker -export-dynamic -o python-debug \
		Modules/python.o \
		-L. -lpython2.7_d -lpthread -ldl  -lutil   -lm  
/bin/sh: line 1: 21015 Segmentation fault      (core dumped) LD_LIBRARY_PATH=/builddir/build/BUILD/Python-2.7.1/build/debug: CC='gcc -pthread' LDSHARED='gcc -pthread -shared ' OPT='-O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64 -mminimal-toc -D_GNU_SOURCE -fPIC -fwrapv' ./python-debug -E /builddir/build/BUILD/Python-2.7.1/setup.py -q build
RPM build errors:

Version-Release number of selected component (if applicable):
Comment 1 Karsten Hopp 2011-04-21 11:42:38 EDT
full logs at https://ppc.koji.fedoraproject.org/koji/taskinfo?taskID=197370
Comment 2 Karsten Hopp 2011-07-05 11:50:46 EDT
a similar problem still exists in python-2.7.2-4.fc16 on ppc64:

/builddir/build/BUILD/Python-2.7.2/Modules/posixmodule.c:7317: warning: the use of `tempnam' is dangerous, better use `mkstemp'
gcc -fno-strict-aliasing -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -Wl,-z,relro -m64 -mminimal-toc -D_GNU_SOURCE -fPIC -fwrapv   -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -Wl,-z,relro -m64 -mminimal-toc -D_GNU_SOURCE -fPIC -fwrapv -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -Wl,-z,relro -m64 -mminimal-toc -D_GNU_SOURCE -fPIC -fwrapv    -Xlinker -export-dynamic -o python-debug \
		Modules/python.o \
		-L. -lpython2.7_d -lpthread -ldl  -lutil   -lm  
/bin/sh: line 1:  3599 Segmentation fault      (core dumped) LD_LIBRARY_PATH=/builddir/build/BUILD/Python-2.7.2/build/debug: CC='gcc -pthread' LDSHARED='gcc -pthread -shared ' OPT='-O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -Wl,-z,relro -m64 -mminimal-toc -D_GNU_SOURCE -fPIC -fwrapv' ./python-debug -E /builddir/build/BUILD/Python-2.7.2/setup.py -q build
make: *** [sharedmods] Error 139

full logs at http://ppc.koji.fedoraproject.org/koji/taskinfo?taskID=245909

The latest successfull python build was python-2.7.1-4.fc15, python-2.7.1-5.fc15 already failed with this python.debug segmentation fault.
The difference is that -4 built with gcc-4.5.1-6 and -5 with gcc-4.6.0-0.12
I wonder if gcc is acting up here
Comment 3 Jakub Jelinek 2011-07-05 14:50:42 EDT
It is much more probable it is an application bug (relying on undefined behavior etc.) than a gcc bug, though of course that can't be ruled out.
If you suspect a miscompilation, let somebody familiar with the source code first try to see if some gcc flags makes it working again (e.g. if compiling with -O0 makes it work, or -O2 -fno-strict-aliasing, etc.).  If yes, try to do a binary search between objects compiled with the working options and non-working to narrow the problem to one source file.  If not, do a similar binary search between older compiler compiled objects and new compiler compiled objects.
Once you know which source file is problematic, first compile it with -W -Wall, look at all the warnings, see if some of them might not show up problem in the code, if not, try to narrow down the problem to a particular source file (and see if it can be reproduced even with -fno-inline, that helps to narrow it down to a function), then try to create a self-contained testcase calling that function with the right arguments and abort or somehow else signal if that function misbehaves.
Comment 4 Dave Malcolm 2011-07-05 15:13:06 EDT
Here's the start of the definition of that function:

static PyObject *
call_function(PyObject ***pp_stack, int oparg
#ifdef WITH_TSC
                , uint64* pintr0, uint64* pintr1

It could be that the WITH_TSC is confused, perhaps, but there is this forwards-declaration:
#ifdef WITH_TSC
static PyObject * call_function(PyObject ***, int, uint64*, uint64*);
static PyObject * call_function(PyObject ***, int);

and it appears to be used consistently throughout.

I don't know if this is significant, but frame #0 in that backtrace is reported with the arguments in reverse order to those of the declaration.

Here's the frame from the backtrace:

#0  call_function (pintr1=0xfffcf8b6be0, pintr0=0xfffcf8b6bd8, oparg=<optimized out>, pp_stack=0xfffcf8b6be8)
    at /builddir/build/BUILD/Python-2.7.1/Python/ceval.c:4105

Every other frame appears to be reported with the arguments in the same order as in the declaration; this one is reported in reverse order.
Comment 8 Dave Malcolm 2011-08-23 12:41:37 EDT
If I disable line 56 within:
  ppc_getcounter(uint64 *v)
in Python/ceval.c, then the problem goes away:
    32  typedef unsigned long long uint64;
    34  /* PowerPC support.
    35     "__ppc__" appears to be the preprocessor definition to detect on OS X, whereas
    36     "__powerpc__" appears to be the correct one for Linux with GCC
    37  */
    38  #if defined(__ppc__) || defined (__powerpc__)
    40  #define READ_TIMESTAMP(var) ppc_getcounter(&var)
    42  static void
    43  ppc_getcounter(uint64 *v)
    44  {
    45      register unsigned long tbu, tb, tbu2;
    47    loop:
    48      asm volatile ("mftbu %0" : "=r" (tbu) );
    49      asm volatile ("mftb  %0" : "=r" (tb)  );
    50      asm volatile ("mftbu %0" : "=r" (tbu2));
    51      if (__builtin_expect(tbu != tbu2, 0)) goto loop;
    53      /* The slightly peculiar way of writing the next lines is
    54         compiled better by GCC than any other way I tried. */
    55      ((long*)(v))[0] = tbu;
    56      /*((long*)(v))[1] = tb; */ /* <==== this is the bug */
    57  }
    59  #elif defined(__i386__)

(gdb) p sizeof(long)
$44 = 8
(gdb) p sizeof(uint64)
$45 = 8

Looks like lines 55 and 56 are erroneously assuming that a long is 4 bytes on this arch: line 56 above is trashing the next value on the machine's stack.

The code has been this way since ppc_getcounter was added, in:

I may have broken this in:
which was for:
by (perhaps) generalizing support from ppc to (ppc and ppc64) (not sure about this).
Comment 10 Dave Malcolm 2011-08-23 14:27:28 EDT
Workaround for now is to stop using "--with-tsc" when configure the debug build on ppc64

Fix committed to "python" in rawhide (for f17):
  Building python-2.7.2-6.fc17 for dist-rawhide
  Task info: http://koji.fedoraproject.org/koji/taskinfo?taskID=3296206

Fix committed to "python3" in rawhide (for f17):
  Building python3-3.2.1-4.fc17 for dist-rawhide
  Task info: http://koji.fedoraproject.org/koji/taskinfo?taskID=3296208
Comment 13 Dave Malcolm 2011-08-23 15:51:32 EDT
Created attachment 519514 [details]
Candidate patch to fix --with-tsc on ppc64, and to fix aliasing violations on 32-bit ppc

Tested and seems to work on ppc64; am about to test on 32-bit ppc
Comment 14 Dave Malcolm 2011-08-23 16:55:41 EDT
I've applied the patch from attachment #519514 [details] to both python and python3 in rawhide, and re-enabled --with-tsc on ppc64 for the debug build; rebuilding both now:

  Building python-2.7.2-7.fc17 for dist-rawhide
  Task info: http://koji.fedoraproject.org/koji/taskinfo?taskID=3296752

  Building python3-3.2.1-5.fc17 for dist-rawhide
  Task info: http://koji.fedoraproject.org/koji/taskinfo?taskID=3296765
Comment 15 Dave Malcolm 2011-08-23 17:09:34 EDT
For Fedora 16, let's simply disable --with-tsc on ppc64 debug

python (f16)
  Building python-2.7.2-4.1.fc16 for f16-candidate
  Task info: http://koji.fedoraproject.org/koji/taskinfo?taskID=3296788

python3 (f16)
  Building python3-3.2.1-2.1.fc16 for f16-candidate
  Task info: http://koji.fedoraproject.org/koji/taskinfo?taskID=3296797
Comment 16 Dave Malcolm 2011-08-23 17:13:11 EDT
Test case:
  enable tscdump, and run some bytecodes (e.g. by "import logging")

$ python-debug -c "import sys; sys.settscdump(True); import logging"
$ python3-debug -c "import sys; sys.settscdump(True); import logging"

Notes on --with-tsc:
Comment 18 Dave Malcolm 2011-08-23 17:57:56 EDT
[All of the builds succeeded]

Do I need to do a Bodhi update to F16 to pull the fix in for ppc64, or is this unneeded?
Comment 19 Karsten Hopp 2011-08-24 08:25:50 EDT
no, a bodhi update is not needed, I can have a different n-v-r on PPC than on the primary archs, although I try to avoid it if possible. The only requirements are that the patch is in git so that the next python update will work out of the box on PPC and that the next n-v-r on the primary archs is higher than what we have on PPC. Both requirements are met and I can just pull in the new package.

Unfortunately python and python3 still fail to build in koji, although the builds progressed beyond the secfault issue.
I've opened bugzilla 732998 to track the new problem
Comment 20 Dave Malcolm 2011-08-31 17:52:41 EDT
--with-tsc bug and patch reported upstream as http://bugs.python.org/issue12872

Note You need to log in before you can comment on or make changes to this bug.