Bug 138683 - gcc miscompiles numerical code (lapack)
Summary: gcc miscompiles numerical code (lapack)
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Fedora
Classification: Fedora
Component: gcc
Version: 3
Hardware: athlon
OS: Linux
medium
high
Target Milestone: ---
Assignee: Jakub Jelinek
QA Contact:
URL:
Whiteboard:
: 138685 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2004-11-10 18:02 UTC by Dmitri A. Sergatskov
Modified: 2007-11-30 22:10 UTC (History)
2 users (show)

Fixed In Version: 3.0-26.fc3
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2006-10-30 22:13:02 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)
lapack file causing the problem (3.61 KB, application/octet-stream)
2004-11-11 06:55 UTC, Dmitri A. Sergatskov
no flags Details
Simple test code which causes LAPACK to hang (1.07 KB, text/plain)
2005-01-14 01:19 UTC, Fernando Perez
no flags Details

Description Dmitri A. Sergatskov 2004-11-10 18:02:56 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; rv:1.7.3) Gecko/20041020
Firefox/0.10.1

Description of problem:
It appears that gcc in FC3 miscompiles numerical code. 
The problem appears to be with lapack libraries and can be
demonstrated with octave (which uses them):

[dima@localhost ~]$ octave
GNU Octave, version 2.1.57 (i686-pc-linux-gnu).
Copyright (C) 2004 John W. Eaton.
....

octave:1> a=rand(100);
octave:2> tic; eig(a); toc
error: dgeev failed to converge
octave:2>
---------------
Sometimes it just hangs there for few minutes after which I kill it.
I tried to compile octave myseelf agains ATLAS (different, optimized
blas/lapack implementation) libraries, which I compile myself as well.
The result was the same. I also tried different versions of octave.

This al works on RHEL3, RH9, FC2, FC1. 

It is possible that the problem is actually with glibc. I was not able
to recompile octave with gcc33 to check that.


Version-Release number of selected component (if applicable):
gcc version 3.4.2 20041017 (Red Hat 3.4.2-6.fc3)

How reproducible:
Always

Steps to Reproduce:
1. start octave
2. type at the octave prompt as shown
3.
    

Actual Results:  octave hangs or gives an error 

Expected Results:  "0.004" (may vary slightly) -- this is 4 msec that
it too the code to run on FC2.

Additional info:

this is on athlon/xp 2000MHz/500Meg.

Comment 1 Jakub Jelinek 2004-11-10 22:39:28 UTC
*** Bug 138685 has been marked as a duplicate of this bug. ***

Comment 2 Dmitri A. Sergatskov 2004-11-11 02:58:58 UTC
It is me again. Recompiling both lapack and octave with FFLAGS="-O
-ffloat-storage" seems to solve this problem.  

Comment 3 Jakub Jelinek 2004-11-11 05:44:22 UTC
Then it is IMHO not a GCC bug.  -ffloat-store is not the default on purpose, it is too slow and most of the software out there doesn't need it.

IMHO you want to open a bug against lapack (and/or octave) and request that
it be compiled in two versions on IA-32: -ffloat-store and -mfpmath=sse -msse2
(the latter for P4 & recent AMD CPUs).

Comment 4 Dmitri A. Sergatskov 2004-11-11 06:19:13 UTC
The issue here is that lapack is being around for a while and is
compiled by previous generations of gcc as well as bunch of other 
compilers just fine. Suddenly it breaks, which make me think that 
the default compiler options are not "safe." When clock chimes 13
times it is not the 13th chime that is broken, it is the clock that
needs fixing. I did catch lapack, but who knows what else might be
broken? 
As for the speed part -- I heard rumors that -ffloat-store is 
slow, but in fact all my benchmark run now with the same speed as 
always (and some, which involve itterations, like Schur decomposition,
runs about 20% faster because of the faster convergence). 
Anyway, I did not think of -ffloat-stare as a fix, but ruther a
workaround for a potentially more serious problem with gcc.


Comment 5 Jakub Jelinek 2004-11-11 06:36:57 UTC
Beg to disagree.  For code that relies on computation not being done with extra
precision -ffloat-store is a must and lapack clearly relies on it.
Why things worked in this particular case with < GCC 3.4 and don't work anymore
is most probably because GCC is now better at optimizing and likely will have less spills to memory (and only spills to memory on the mis-designed i387 FPU
round to the declared precision instead of using full long double precision).

If you give me one exact routine in lapack that causes the problems you are seeing, I can look at it in detail and tell exactly what is going on.
But I certainly don't intend to debug half of octave/lapack to figure it out.

Comment 6 Dmitri A. Sergatskov 2004-11-11 06:55:56 UTC
Created attachment 106477 [details]
lapack file causing the problem

Comment 7 Jakub Jelinek 2004-11-11 07:01:04 UTC
Can you please also specify what exact arguments you are calling that routine with?  Thanks.

Comment 8 Dmitri A. Sergatskov 2004-11-11 07:04:44 UTC
All my comments are gone -- I will try again.

I understand that the problem is (most likely) due to aggressive
optimization. My concern was that it is _too_ aggressive and produces
wrong code. 

The error which I posted was in the file dgeev.f (from lapack).
I attached the file for the reference. 
I am going to write to octave mailing list about it. Perhaps it would
be good to involve lapack peopl as well. I did file this as a bug 
against lapack:
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=138791

Thanks for your attention and a fast response.
Sincerely,

Dmitri. 

Comment 9 Dmitri A. Sergatskov 2004-11-11 07:10:57 UTC
Since I do not call it directly (I use octave, which calls this 
lapack routine), it is hard for me to tell you how it is called
exactly. I am going to write to Octave mailing list and perhaps 
John Eaton (Octave author) can give you an authoritative answer.

Sincerely,

Dmitri.
 

Comment 10 Fernando Perez 2005-01-14 01:19:41 UTC
Created attachment 109763 [details]
Simple test code which causes LAPACK to hang

Compile with 
f77 test_svd.f -llapack -o test_svd_fedora

Comment 11 Fernando Perez 2005-01-14 01:22:21 UTC
I just created an attachment (sorry that I didn't put all the comments
there, I thought it would bring me back here).

The above line will produce a binary linked against the dynamic
lapack/blas in FC3:

planck[libmwrep]> ldd test_svd_fedora
        liblapack.so.3 => /usr/lib/liblapack.so.3 (0x003d6000)
        libg2c.so.0 => /usr/lib/libg2c.so.0 (0x00db6000)
        libm.so.6 => /lib/tls/libm.so.6 (0x00d30000)
        libgcc_s.so.1 => /lib/libgcc_s.so.1 (0x002f9000)
        libc.so.6 => /lib/tls/libc.so.6 (0x00c04000)
        libblas.so.3 => /usr/lib/libblas.so.3 (0x00101000)
        /lib/ld-linux.so.2 (0x00beb000)

If I run this exact same binary on a RedHat9.0 box, it runs fine:

kellogg[libmwrep]> ./test_svd_atlas
 Entering dgesvd. If this takes more than a second or two
 it means it has hanged.  Kill it with Ctrl-C
 dgesvd finished
 svals:
  4.  8.32667268E-17  4.67733824E-51  1.52114348E-84

However, on a Fedora3 machine it hangs forever, with 100% cpu utilization.

I hope this can be fixed with a new lapack release which is correctly
compiled soon.

Many thanks in advance,

f.

Comment 12 Dmitri A. Sergatskov 2005-01-14 01:30:56 UTC
You can get lapack from rawhide (development tree) which does not have
this problem. I personally added -ffloat-store to the FFLAGS in
src.rpm and rebuild an rpm file. Works fine:

[dima@tumbleweed bug]$ ./test_svd_fedora
 Entering dgesvd. If this takes more than a second
 it means it has hanged.  Kill it with Ctrl-C
 dgesvd finished
 svals:
  4.  8.32667268E-17  4.67733824E-51  1.52114348E-84

I still do not understand why the fix is not in the update tree...

Dmitri.

Comment 13 Dmitri A. Sergatskov 2005-01-14 01:36:40 UTC
BTW, the bug for lapack is:
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=138447


Comment 14 Fernando Perez 2005-01-14 01:47:16 UTC
Dmitri, thanks for the lapack bug number.  Unfortunately it is not tru
that the problem is solved, as is claimed in that bug.

I hope they reopen it and actually fix it.

Cheers,

f

Comment 15 Fernando Perez 2005-01-14 01:58:52 UTC
OK, I just fished out the -28 blas/lapack RPMs out of a 'development'
fedora repo, and slapped them onto my FC3 box.  The problem is indeed
solved by them.

But they should be backported to FC3-updates, please.  These corrected
packages have been available since 12/21, and I would not have wasted
a whole day tracking this (which I thought was a bug in my code) if
only they had been made available to updates.  Yum would have nicely
picked them up right away.

Please put these corrected -28 blas/lapack packages on the released
updates repos for FC3.

And many thanks to Dmitri for your help!

Comment 16 Dmitri A. Sergatskov 2005-01-25 04:15:31 UTC
I wonder if this is related to:

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=5900



Comment 17 Pawel Salek 2005-01-28 13:11:39 UTC
See https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=146447 for
analysis and trivial test program. When lapack is determining hardware
accuracy, it goes through the jumps and calls trivial DLAMC3 to
enforce memory storage instead of just doing addition:
      DOUBLE PRECISION FUNCTION DLAMC3( A, B )
      DOUBLE PRECISION   A, B
      DLAMC3 = A + B
      RETURN
What is interesting: compiling the test code from 146447 with intel
fortran compiler never exposes the problem, even when compiled with
most agressive optimization options. gcc generates following code:

gcc:
        pushl   %ebp
        movl    %esp, %ebp
        movl    12(%ebp), %edx
        fldl    (%edx)
        movl    8(%ebp), %eax
        faddl   (%eax)
        leave
        ret

while ifort:
dlamc3_:
# parameter 1: %eax
# parameter 2: %edx
..B2.1:                         # Preds ..B2.0
        movl      4(%esp), %eax                                 #187.32
        movl      8(%esp), %edx                                 #187.32
        .globl   dlamc3_.
dlamc3_.:                                                       #
        subl      $8, %esp                                      #187.32
        movsd     (%eax), %xmm0                                 #215.6
        addsd     (%edx), %xmm0                                 #215.6
        movsd     %xmm0, (%esp)                                 #215.6
        fldl      (%esp)                                        #215.6
        addl      $8, %esp                                      #221.6
        ret                                                     #221.6

Perhaps the parameters are passed differently?

Comment 18 Matthew Miller 2006-07-10 21:35:01 UTC
Fedora Core 3 is now maintained by the Fedora Legacy project for security
updates only. If this problem is a security issue, please reopen and
reassign to the Fedora Legacy product. If it is not a security issue and
hasn't been resolved in the current FC5 updates or in the FC6 test
release, reopen and change the version to match.

Thank you!


Comment 19 John Thacker 2006-10-30 22:13:02 UTC
This did get pushed to FC3 as an update as of lapack-3.0-26.fc3.  It's fixed in
FC4 and forwards.  Since FC3 is Fedora Legacy for security updates only, and it
was fixed as an update, closing the bug.


Note You need to log in before you can comment on or make changes to this bug.