166657 – Stack smashing detected on ppc64

Bug 166657 - Stack smashing detected on ppc64

Summary: Stack smashing detected on ppc64

Keywords:
Status:	CLOSED RAWHIDE
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	gcc
Sub Component:
Version:	rawhide
Hardware:	powerpc
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Tom Tromey
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2005-08-24 13:43 UTC by Gary Benson
Modified:	2014-08-11 05:46 UTC (History)
CC List:	5 users (show)
Fixed In Version:	4.0.2-1
Clone Of:
Environment:
Last Closed:	2005-09-30 07:46:28 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Backtrace (33.08 KB, text/plain) 2005-08-24 15:05 UTC, Gary Benson	no flags	Details
ClassFileStruct.class (4.46 KB, application/octet-stream) 2005-08-25 16:29 UTC, Gary Benson	no flags	Details
patch which should fix this but I have not tested it yet (759 bytes, patch) 2005-09-22 22:25 UTC, Andrew Pinski	no flags	Details \| Diff
tested patch (2.24 KB, patch) 2005-09-22 22:30 UTC, Tom Tromey	no flags	Details \| Diff
patch with test case (3.47 KB, patch) 2005-09-23 16:04 UTC, Tom Tromey	no flags	Details \| Diff
Show Obsolete (2) View All

Description Gary Benson 2005-08-24 13:43:42 UTC

Description of problem:
Tomcat fails to build on ppc64, aborting with the message '*** stack smashing
detected ***: /usr/lib/jvm/java/bin/java terminated'.

Version-Release number of selected component (if applicable):
libgcj-4.0.1-10

How reproducible:
Always

Steps to Reproduce:
1. rpmbuild --rebuild tomcat5-5.0.30-8jpp_2fc.src.rpm
  
Actual results:
*** stack smashing detected ***: /usr/lib/jvm/java/bin/java terminated

Expected results:
Not that ;)

Comment 1 Gary Benson 2005-08-24 15:05:03 UTC

Created attachment 118070 [details]
Backtrace

Comment 2 Jakub Jelinek 2005-08-25 06:20:51 UTC

_Jv_InterpMethod::run (void *retp, ffi_raw *args, _Jv_InterpMethod *meth)
uses at least 2 big VLAs (stack and locals), guess one of them is overflown.

Comment 3 Gary Benson 2005-08-25 09:11:35 UTC

This problem looks to have been exposed because everything is running as
bytecode on ppc64.  Seems that gnu.gcj.precompiled.db.path is pointing into
/usr/lib and not /usr/lib64 (it is on x86_64 as well).  I forced it, to see if
the stack smash went away, but that started throwing NPEs with broken
stacktraces :-/

Comment 4 Tom Tromey 2005-08-25 15:52:01 UTC

First, I wonder if this could be:

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=23404

(Though in the end that PR says it is for ppc and not ppc64..)


Second, Gary, is there any way you could find out exactly what method
is provoking the stack trashing and then send me the class file?  That
might help debug the interpreter.

Comment 5 Tom Tromey 2005-08-25 16:27:34 UTC

BTW -- have we tried running the libffi test suite built with the
stack-smashing detector in place?
If it is a libffi bug, perhaps that would trigger the stack smasher.

Comment 6 Gary Benson 2005-08-25 16:29:09 UTC

Created attachment 118114 [details]
ClassFileStruct.class

Here's the class; the method is "double doubleAt(int)".

Comment 7 Jakub Jelinek 2005-08-25 16:35:56 UTC

Yes, all of rawhide gcc is built with -fstack-protector and make check is run
including libffi.  There were no regressions in libffi testsuite on any
architecture between 4.0.1-6 and 4.0.1-8 (the former one was before
-fstack-protector has been added to FC5 RPM_OPT_FLAGS, the latter after it).
There were a bunch of libffi x86 regressions in 4.0.1-7 because of the change,
but those have been fixed before 4.0.1-8.

Comment 8 Tom Tromey 2005-08-25 16:57:24 UTC

Thanks for the class file.
I looked through the interpreter a bit; some of the stack and locals
handling on 64 bit platforms looks a bit fishy to me.
Could you give this test case a try?  It is basically just the
same failing method with a small wrapper.

public class pr166657 {
  public long toLong(int x) { return (long) x; }
  public double doubleAt(int x) {
    return Double.longBitsToDouble(toLong(x));
  }
  public static void main(String[] args) {
    pr166657 p = new pr166657();
    System.out.println(p.doubleAt(5));
  }
}

Comment 9 Gary Benson 2005-08-25 20:49:41 UTC

Hmmm, fishy indeed: i386, ppc and s390 all printed out 2.5E-323, but:
 - ppc64 gave the stack smashing warning
 - x86_64 sat there consuming 100% cpu
 - s390x thought for a bit then segfaulted
 - ia64 printed out 4.440892098500626E-15

Comment 10 Tom Tromey 2005-08-26 19:36:48 UTC

I think the problem is that we're passing 2 stack slots
to the ffi call on a 64 bit arch, but a double or long
only actually takes a single stack slot.
BTW I probably can't work on this further until next Thursday :-(

Comment 11 Tom Tromey 2005-09-09 19:36:33 UTC

On x86-64 the hang seems to be in mprec.c:

#0  _Jv_lshift (ptr=0x7fffffc94bc0, b=0x7fffffc957e8, k=1)
    at ../../../../../../gcc/libjava/classpath/native/fdlibm/mprec.c:463
#1  0x00002aaaabc1fc68 in _Jv_dtoa_r (ptr=0x7fffffc94bc0, _d=Variable "_d" is
not available.
)
    at ../../../../../../gcc/libjava/classpath/native/fdlibm/dtoa.c:805
#2  0x00002aaaabc1fe48 in _Jv_dtoa (_d=2.4703282292062327e-323, mode=0,
    ndigits=20, decpt=0x7fffffc95e7c, sign=0x7fffffc95e78, rve=0x0,
    buf=0x7fffffc95e10 "\uffff^\uffff\uffff\uffff\177", float_type=0)
    at ../../../../../../gcc/libjava/classpath/native/fdlibm/dtoa.c:902
#3  0x00002aaaab826ae4 in java::lang::Double::toString (
    value=2.4703282292062327e-323, isFloat=Variable "isFloat" is not available.
)
    at ../../../gcc/libjava/java/lang/natDouble.cc:79


462       for (i = b->_maxwds; n1 > i; i <<= 1)
(gdb) p b._maxwds
$6 = 0
(gdb) p n1
$7 = 1
(gdb) p i
$8 = 0


This isn't helpful to the main bug but does indicate that something
went wrong somewhere else.

Comment 12 Tom Tromey 2005-09-09 19:38:44 UTC

Also, fwiw, on x86-64 the computed value does seem to be correct:

(gdb) up 3
#3  0x00002aaaab826ae4 in java::lang::Double::toString (
    value=2.4703282292062327e-323, isFloat=Variable "isFloat" is not available.
)
    at ../../../gcc/libjava/java/lang/natDouble.cc:79
79        _dtoa (value, 0, 20, &decpt, &sign, NULL, buffer, (int)isFloat);

Comment 13 Tom Tromey 2005-09-09 19:45:10 UTC

I'm not so sure about comment #10 any more.
I think the "java" api in libffi/java_raw_api.c tries to handle
this case correctly, by skipping the second dummy value:

#if FFI_SIZEOF_ARG == 8	  
	case FFI_TYPE_UINT64:
	case FFI_TYPE_SINT64:
	case FFI_TYPE_DOUBLE:
	  *args = (void *)raw;
	  raw += 2;
	  break;
#endif

Comment 14 Gary Benson 2005-09-12 09:26:37 UTC

Doesn't that skip the second slot only when it doesn't need to?

Comment 15 Tom Tromey 2005-09-13 15:32:24 UTC

Nope, that is correct.

Conceptually each java stack slot is 32 bits.
On a 32 bit machine, long and double value are split over 2 slots.
And since this is somewhat visible at the virtual machine level
(eg, dup won't duplicate a long) it is pretty much forced on the
interpreter.

So, on a 64 bit machine we store a 64 bit value in a single slot
and just leave the second slot empty (so that we don't have to track
types of things all over the interpreter).

What the java api code is doing here is skipping this unused slot
that appears in the arguments.

Comment 16 Gary Benson 2005-09-13 15:35:48 UTC

Ah, ok.

Comment 17 Tom Tromey 2005-09-13 15:57:48 UTC

I built gcc4.0 cvs head on a RHEL 3 ia64 machine.
gdb doesn't work reliably on this platform so I
instrumented the source with printfs to see what was
going on.  Here is the end of the output, showing the
call to println:

  calling println ; 3 args
    println: 2.47033e-323
  ffi_java_raw_call: allocating 16
  ffi_java_raw_to_ptrarray: orig = 0x60000fffffffab30
    i = 0, raw after = 0x60000fffffffab38
    double arg = 2.47033e-323
    i = 1, raw after = 0x60000fffffffab48
  PrintStream: 4.44089e-15

So, the value is correct in the interpreter and correct
in ffi_java_raw_to_ptrarray.  Then when it arrives in
println it is wrong.

The PrintStream instrumentation just calls a new native
function to print the double; this native function uses
printf underneath.

I think one of two things is happening.

First, it could be that libffi is internally consistent
but it disagrees with the compiler as to how doubles are
passed.

Another possibility would be that there is a bug in
how double arguments are passed to CNI methods.

However, note that with this test case the first
value is printed correctly but the second is not:

public class t {
  public static void main(String[] args) {
    System.out.println(5.0);
    System.out.println(Double.longBitsToDouble(5));
  }
}

I don't have a theory that covers that.

At this point I think it would be more efficient
for someone more knowledgeable about the ia64 to
look at this.

Comment 18 Tom Tromey 2005-09-13 16:45:29 UTC

On x86-64 the immediate problem is that the fdlibm bignum
code overflows an internal buffer, overwriting a different
bignum and causing an infinite loop.

A wordaround is to change the definition of MAX_BIGNUM_WDS
in mprec.h like so:

#define MAX_BIGNUM_WDS 64

I'm not yet sure why this overflow occurs.

Comment 19 Tom Tromey 2005-09-13 17:07:07 UTC

Actually ... I enabled the assertion code and
to avoid an abort you must use 128, not 64.

Comment 20 Tom Tromey 2005-09-13 18:11:27 UTC

Possible workaround for x86-64 problem:

http://gcc.gnu.org/ml/java-patches/2005-q3/msg00354.html

Comment 21 Tom Tromey 2005-09-13 21:43:18 UTC

Applying the fix mentioned above seems to fix the problem
on PPC64 as well.  At least, before the fix I get a crash,
and afterward I do not.

I have not looked at the s390x failure.

Comment 22 Gary Benson 2005-09-14 08:43:27 UTC

The ppc64 failure is the only one that's causing me problems.  For some reason
the other failures do not happen in beehive.

Comment 23 Tom Tromey 2005-09-14 17:22:33 UTC

Well, if we think it is ok to allocate a 16k buffer on the stack,
then the patch in comment #20 should go in.

Comment 24 Gary Benson 2005-09-15 08:49:52 UTC

I guess the fact that you're reticent means it's not ok.  How big is the stack
anyway, and what happens when you run out?

From looking at it the change affects only the size of _Jv_Bigint, which in turn
only seems to be used when converting strings to doubles and vice versa.  If it
only affects this, and only on 64 bit platforms, and if the failure mode of a
stack overflow is meaningful, then it might be worth a try.

Comment 25 Tom Tromey 2005-09-15 18:26:11 UTC

As discussed internally, this patch is probably fine.
gcc's 4.0.x branch is frozen atm, but we could put this into
the RH gcc.  Jakub...?

Comment 26 Gary Benson 2005-09-20 08:59:44 UTC

The stack smashing is still happening with 4.0.1-14 :-/

Comment 27 Tom Tromey 2005-09-22 15:08:33 UTC

I spent some more time debugging this.
FWIW the earlier patch does fix a real problem, just not the same
problem we're seeing on PPC64; I suspect it "started working" for
me because I was using the wrong compiler flags.

With my current build I can definitely reproduce the stack smash.
Using -O0 when compiling the interpreter works around it.

Currently I suspect that the smashing actually occurs in libffi.
However, I am not 100% certain.

If you look at libjava/interpret.cc you can see the FFI call in the
interpreter:

	_Jv_value rvalue;

#if FFI_NATIVE_RAW_API
	/* We assume that this is only implemented if it's correct	*/
	/* to use it here.  On a 64 bit machine, it never is.		*/
	ffi_raw_call (cif, fun, (void*)&rvalue, raw);
#else
	ffi_java_raw_call (cif, fun, (void*)&rvalue, raw);
#endif

If I change rvalue to be declared as '_Jv_value rvalue[2]' (and the
uses to be &rvalue[0]), then the problem goes away.

Comment 28 Tom Tromey 2005-09-22 17:29:19 UTC

With the above change I set a watchpoint on rvalue[1].
It triggers in libffi:

.Lfp_return_value:
	bf	28, .Lfloat_return_value
	stfd	%f1, 0(%r30)
	stfd	%f2, 8(%r30)	/* It might be a long double */
-->	b	.Ldone_return_value

I don't know anything about ppc assembly, but that second
stfd looks like the culprit.

Removing this line and reverting to having rvalue be a 1-element
array removes the stack smash.  (Adding the line back and keeping
rvalue as a 1-element array shows the smash again...)

Jakub or Andrew, could one of you suggest a real fix?
I'll leave my build around; I'd be happy to test it.

Comment 29 Andrew Pinski 2005-09-22 22:25:56 UTC

Created attachment 119168 [details]
patch which should fix this but I have not tested it yet

Comment 30 Tom Tromey 2005-09-22 22:30:13 UTC

Created attachment 119169 [details]
tested patch

I wrote this patch by talking to Andrew Pinskia and Andreas Tobler
on irc and by reading the Darwin code, which apparently handles this
correctly.  The problem is that the Linux64 code was not distinguishing
between double and long double returns.

Andrew, Jakub -- comments?  If it looks ok to you I'll submit to gcc.

Comment 31 Jakub Jelinek 2005-09-22 23:04:10 UTC

Two things:
1) FFI_TYPE_STRUCT case falls through into the FFI_TYPE_LONGDOUBLE case, so
I guess you need to move it elsewhere and goto, or test for type ==
FFI_TYPE_LONGDOUBLE in the if condition as well
2) is the
        bf      28, .Lfloat_return_value
+       bt      27, .Lfd_return_value
order correct?  I mean, won't with FFI_TYPE_LONGDOUBLE bit 28 be 0 and 27 1?
Then this would return the long double using .Lfloat_return_value, rather than
.Lfd_return_value (shouldn't that label be .Lldouble_return_value instead btw?).
Neither of these is too much important ATM, as Linux uses for the time being
a 64-bit long double AFAIK (i.e. sizeof (double) == sizeof (long double))
and there is:
#if FFI_TYPE_LONGDOUBLE != FFI_TYPE_DOUBLE
      if (type == FFI_TYPE_LONGDOUBLE)
        type = FFI_TYPE_DOUBLE;
#endif
a few lines above it.  But gcc already has a switch to enable 128 bit long double
on ppc64-linux, so perhaps one day...

Comment 32 Tom Tromey 2005-09-22 23:14:09 UTC

Thanks for the review.
I'll write a better patch and a few test cases tomorrow.

Comment 33 Tom Tromey 2005-09-23 16:04:19 UTC

Created attachment 119184 [details]
patch with test case

How about this patch?
I fixed the problems you noted and also updated an existing test
case to check for this bug.

Comment 34 Tom Tromey 2005-09-26 19:57:29 UTC

This patch is in GCC cvs head.
It should go in the 4.0 branch as well, once that is not frozen.

Note You need to log in before you can comment on or make changes to this bug.